Skip to article frontmatterSkip to article content

Evaluating AI Systems: Testing Agents at Scale

University of Central Florida
Valorum Data

Computational Analysis of Social Complexity

Fall 2025, Spencer Lyon

Prerequisites

  • Pydantic AI Agents and Tools
  • Python programming fundamentals
  • Basic understanding of testing concepts

Outcomes

  • Understand why systematic evaluation is critical for AI systems
  • Identify when and what to evaluate in AI agents
  • Implement deterministic and LLM-based evaluators
  • Design evaluation datasets using code-first approaches
  • Analyze and compare evaluation results across experiments
  • Connect evaluation practices to production deployment concerns

References

Introduction: The AI Testing Problem

Why Testing AI is Different

  • Traditional software testing: deterministic inputs → deterministic outputs
    • If add(2, 3) returns 5 once, it always will
    • Clear pass/fail criteria
  • AI systems: same input → potentially different outputs
    • Ask an agent “What’s the capital of France?” twice, might get:
      • “The capital of France is Paris.”
      • “Paris is France’s capital city.”
      • “France’s capital is Paris, known for the Eiffel Tower.”
    • All correct, but different!
  • Key challenge: How do we test something that’s non-deterministic?

Motivating Scenario: The Support Bot Problem

  • You’ve built a customer support agent using Pydantic AI
  • It handles questions about product returns, shipping, and account issues
  • In development, it seems to work great on the examples you tried
  • You deploy to production and...
    • Sometimes gives outdated return policy information
    • Occasionally “hallucinates” shipping partners that don’t exist
    • Has trouble with edge cases you didn’t think to test
  • Question: How could systematic evaluation have caught these issues?

What We’ll Learn

  • We’ll explore Pydantic Evals, a framework for testing AI systems
  • Three main components:
    1. Datasets: Collections of test scenarios
    2. Evaluators: Scoring mechanisms to check outputs
    3. Experiments: Runs that combine datasets and evaluators
  • Think of it like unit testing for AI:
    • Cases + Evaluators = individual unit tests
    • Datasets = test suites
    • Experiments = running your entire test suite

Setup

Install and Import Dependencies

We’ll need Pydantic AI and the evals module:

# Install pydantic-ai if needed
# %pip install pydantic-ai[anthropic,evals] pydantic-evals
from pydantic_ai import Agent
from pydantic_evals import Dataset, Case
from pydantic_evals.evaluators import (
    Evaluator,
    EvaluatorContext,
    EvaluationReason,
    EqualsExpected,
    Contains,
    IsInstance,
    MaxDuration,
    LLMJudge,
    HasMatchingSpan,
)
from pydantic import BaseModel
import os

Configure API Keys

Make sure you have your Anthropic API key set:

from dotenv import load_dotenv
import nest_asyncio

load_dotenv()
nest_asyncio.apply()
# Your API key should be set in environment
# os.environ["ANTHROPIC_API_KEY"] = "your-key-here"

# Verify it's set
assert "ANTHROPIC_API_KEY" in os.environ, "Please set ANTHROPIC_API_KEY environment variable"

Core Concepts: The Evaluation Framework

Structure: Cases, Datasets, and Experiments

Cases: Individual test scenarios

  • Like a single unit test
  • Contains:
    • inputs: Data you pass to the agent
    • expected_output: (Optional) What you expect back
    • metadata: (Optional) Context about this test
    • evaluators: (Optional) Case-specific checks

Datasets: Collections of cases

  • Like a test suite
  • Groups related test scenarios
  • Can have dataset-level evaluators that apply to all cases

Experiments: Evaluation runs

  • Like running pytest or julia test
  • Executes your task function on all cases
  • Applies evaluators to score outputs
  • Generates reports with results

Example: Simple Support Bot Dataset

Let’s build intuition with a concrete example.

Suppose we want to test our support bot’s ability to identify the user’s intent:

# Define test cases
intent_dataset = Dataset[str, str](
    name="Intent Classification Tests",
    cases=[
        Case(
            name="return_request",
            inputs="I want to return my order",
            expected_output="return"
        ),
        Case(
            name="shipping_status",
            inputs="Where is my package?",
            expected_output="shipping"
        ),
        Case(
            name="account_question",
            inputs="How do I reset my password?",
            expected_output="account"
        )
    ]
)

print(f"Dataset: {intent_dataset.name}")
print(f"Number of cases: {len(intent_dataset.cases)}")
Dataset: Intent Classification Tests
Number of cases: 3

Note the pattern:

  • Each case tests one scenario
  • We specify what we expect
  • Cases are typed: Dataset[str, str] means string inputs → string outputs

Exercise 1: Design Your Own Cases

Scenario: You’re building a sentiment analysis agent for product reviews.

Task: Following the pattern above, create a Dataset[str, str] with 3-5 test cases for an agent that should classify product reviews as:

  • “positive”
  • “negative”
  • “neutral”

Think about:

  • What review texts represent typical positive/negative/neutral cases?
  • What are edge cases? (e.g., mixed sentiments, sarcasm)
  • Make sure each Case has:
    • A descriptive name
    • inputs (the review text)
    • expected_output (the sentiment label)
# TODO: Your code here
# Create a sentiment analysis dataset following the pattern above

sentiment_dataset = Dataset[str, str](
    name="Product Sentiment Analysis",
    cases=[
        # TODO: Add your test cases here
        # Case(
        #     name="clearly_positive",
        #     inputs="This product exceeded my expectations! ...",
        #     expected_output="positive"
        # ),
    ]
)

print(f"Created dataset with {len(sentiment_dataset.cases)} cases")
Created dataset with 0 cases

Evaluators: How to Score Outputs

Two Types of Evaluation

Deterministic Evaluators: Code-based checks

  • Exact matches
  • Type checking
  • Format validation (email, phone number, URL)
  • PII detection
  • Regular expression matching

Non-Deterministic Evaluators: Subjective assessment

  • LLM as judge
  • Human evaluation
  • Quality metrics (accuracy, relevance, helpfulness)

Built-in Evaluators

Pydantic Evals provides several ready-made evaluators:

1. Exact Matching

# Check if output equals expected value
evaluator = EqualsExpected()

print(f"Evaluator: {evaluator}")
Evaluator: EqualsExpected()

2. Type Checking

# Ensure output is correct type
evaluator = IsInstance('str')

print(f"Evaluator: {evaluator}")
Evaluator: IsInstance(type_name='str')

3. Membership/Contains

# Check if key phrase appears in output
evaluator = Contains('return policy')

print(f"Evaluator: {evaluator}")
Evaluator: Contains(value='return policy')

4. Performance Constraints

# Ensure agent responds quickly enough
evaluator = MaxDuration(seconds=2.0)  # 2 seconds max

print(f"Evaluator: {evaluator}")
Evaluator: MaxDuration(seconds=2.0)

Example: Adding Evaluators to Dataset

intent_dataset_with_evals = Dataset[str, str](
    name="Intent Classification Tests",
    cases=[
        Case(
            name="return_request",
            inputs="I want to return my order",
            expected_output="return"
        ),
        Case(
            name="shipping_status",
            inputs="Where is my package?",
            expected_output="shipping"
        ),
        Case(
            name="account_question",
            inputs="How do I reset my password?",
            expected_output="account"
        ),
    ],
    evaluators=[
        EqualsExpected(),  # Applied to all cases
        MaxDuration(seconds=2.0)  # Response time check
    ]
)

print(f"Dataset has {len(intent_dataset_with_evals.evaluators)} evaluators")
print(f"Dataset has {len(intent_dataset_with_evals.cases)} cases")
Dataset has 2 evaluators
Dataset has 3 cases

LLM as Judge: When Correctness is Subjective

  • Sometimes there’s no single “correct” answer
  • Example: “Write a friendly response to this complaint”
    • Many valid responses exist
    • Hard to check with deterministic rules
  • Solution: Use another LLM to evaluate

LLMJudge Evaluator:

judge = LLMJudge(
    rubric=(
        "Score from 0-10 on friendliness and helpfulness. "
        "Friendly responses should acknowledge the customer's frustration. "
        "Helpful responses should offer concrete next steps."
    ),
    model='anthropic:claude-haiku-4-5'
)

print("Created LLMJudge evaluator")
Created LLMJudge evaluator

How it works:

  1. Your agent generates an output
  2. LLMJudge sends that output + rubric to an LLM
  3. LLM scores the output based on the rubric
  4. Score is recorded in the evaluation report

Custom Evaluators: Domain-Specific Checks

You can create custom evaluators by subclassing Evaluator:

class ContainsURL(Evaluator):
    """Check if output contains a valid URL."""

    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        output = ctx.output
        # Simple URL detection
        has_url = 'http://' in str(output) or 'https://' in str(output)

        return EvaluationReason(
            value=has_url,
            explanation="URL found" if has_url else "No URL found"
        )

# Test it
url_checker = ContainsURL()
print("Created custom ContainsURL evaluator")
Created custom ContainsURL evaluator

Key points:

  • Implement evaluate method (can be sync or async)
  • Access inputs/outputs through EvaluatorContext
  • Return EvaluationReason with value and explanation

Exercise 2: Design Evaluators

For the sentiment analysis agent from Exercise 1:

  1. What deterministic evaluators would you use?
    • Think about checking exact label matches, valid sentiment values
  2. What subjective aspects might need LLM evaluation?
    • Example: When a review has mixed sentiment, is the chosen label reasonable?
    • What would your LLMJudge rubric say?
  3. Design one custom evaluator for a domain-specific check
    • Example: “Output should be all lowercase” or “Response time should be fast”
# TODO: Your code here
# Create evaluators for sentiment analysis

# Example deterministic evaluator
# sentiment_dataset.evaluators.append(EqualsExpected())

# Example custom evaluator
# class ValidSentiment(Evaluator):
#     """Check if output is a valid sentiment label."""
#     def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
#         valid_sentiments = {"positive", "negative", "neutral"}
#         is_valid = ctx.output.lower() in valid_sentiments
#         return EvaluationReason(
#             value=is_valid,
#             explanation=f"Output is {'valid' if is_valid else 'invalid'} sentiment"
#         )

Running Evaluations: From Code to Reports

The Evaluation Loop

Step 1: Define your task function

# Create a simple intent classifier agent
support_agent = Agent(
    'anthropic:claude-haiku-4-5',
    system_prompt=(
        "You are a customer support intent classifier. "
        "Classify user messages into one of: return, shipping, account. "
        "Respond with ONLY the classification label, nothing else."
    )
)

async def classify_intent(inputs: str) -> str:
    """Our task: classify customer message intent."""
    result = await support_agent.run(inputs)
    return result.output.lower().strip()

print("Created intent classification agent")
Created intent classification agent

Step 2: Run evaluation

# Run experiment
report = await intent_dataset_with_evals.evaluate(
    task=classify_intent,
    max_concurrency=3,  # Run 3 cases in parallel
    progress=True  # Show progress bar
)

print("\nEvaluation complete!")
print(f"Evaluated {len(report.cases)} cases")
Loading...

Step 3: Analyze results

# Print formatted report
report.print()

print("\n" + "="*50)
print("Detailed results:")
print("="*50)

# Or get the data programmatically
for case in report.cases:
    print(f"\nCase: {case.name}")
    print(f"  Input: {case.inputs}")
    print(f"  Output: {case.output}")
    print(f"  Expected: {case.expected_output}")
    print(f"  Assertions: {case.assertions}")
    print(f"  All passed: {all(case.assertions.values())}")
Loading...

Visualizing Results with Logfire

While the printed reports are useful, Pydantic Logfire provides a web UI for visualizing and analyzing evaluation results over time.

Why use Logfire?

  • Interactive dashboards for evaluation metrics
  • Trace exploration for debugging failed cases
  • Track trends across multiple evaluation runs
  • Team collaboration and sharing results

Let’s configure Logfire integration:

Note: With Logfire configured, all subsequent evaluations will automatically send their results to the Logfire web UI. You can then:

  • View evaluation reports in an interactive dashboard
  • Explore individual traces and spans
  • Compare results across multiple runs
  • Set up alerts for failing evaluations

Visit logfire.pydantic.dev to view your evaluation results.

Understanding Evaluation Results

import logfire

# Configure Logfire
# This will automatically send evaluation results to Logfire if token is present
logfire.configure(
    send_to_logfire='if-token-present',

)

print("Logfire configured!")
print("Future evaluations will automatically appear in Logfire web UI")
Logfire configured!
Future evaluations will automatically appear in Logfire web UI
Logfire project URL: ]8;id=50384;https://logfire-us.pydantic.dev/sglyon/cap-6318-example\https://logfire-us.pydantic.dev/sglyon/cap-6318-example]8;;\

Comparing Experiments: Tracking Improvements

Let’s modify our agent and compare results:

# Create an improved agent with better prompt
improved_agent = Agent(
    'anthropic:claude-haiku-4-5',
    system_prompt=(
        "You are a customer support intent classifier. "
        "Classify user messages into exactly one of these categories:\n"
        "- return: for return/refund requests\n"
        "- shipping: for delivery/tracking questions\n"
        "- account: for login/password/profile issues\n\n"
        "Respond with ONLY the classification label in lowercase, nothing else."
    )
)

async def improved_classify_intent(inputs: str) -> str:
    """Improved task function."""
    result = await improved_agent.run(inputs)
    return result.output.lower().strip()

# Run evaluation with improved agent
improved_report = await intent_dataset_with_evals.evaluate(
    task=improved_classify_intent,
    max_concurrency=3,
    progress=True
)
Loading...
# Compare against baseline
improved_report.print(baseline=report)
Loading...

Exercise 3: Run Your First Evaluation

Using the sentiment analysis agent you designed in Exercise 1:

  1. Create an agent that performs sentiment classification
  2. Use your dataset from Exercise 1 (add more cases if needed)
  3. Add 2-3 evaluators (mix of built-in and custom from Exercise 2)
  4. Run the evaluation using await dataset.evaluate(task=your_function)
  5. Interpret the results:
    • Which cases passed/failed?
    • What patterns do you notice?
    • What would you improve about the agent or the test cases?
# TODO: Your code here

Advanced Topics: Span-Based Evaluation and Dataset Generation

Span-Based Evaluation: Evaluating the Process, Not Just the Output

The Problem: Sometimes the final answer is correct, but the how matters

  • Example: Math problem solving
    • Output: “42” ✓ Correct!
    • But did the agent:
      • Use the right formula?
      • Show its work?
      • Make calculation errors that happened to cancel out?

Spans: Execution traces from OpenTelemetry

  • Capture what the agent did internally
  • Tool calls made
  • LLM requests and responses
  • Intermediate reasoning steps

HasMatchingSpan Evaluator:

# Ensure agent called a specific tool
span_evaluator = HasMatchingSpan(
    query={'name_contains': 'calculator_tool'},
    evaluation_name='used_calculator'
)

print("Created span-based evaluator")
print("This evaluator checks that a span named 'calculator_tool' was called")

Why this matters:

  • Catches “lucky guesses” where agent gets answer right for wrong reasons
  • Validates agent is following intended reasoning process
  • Useful for multi-step tasks where process correctness matters

Generating Datasets with LLMs

The Challenge: Creating comprehensive test datasets is tedious

  • Need diverse inputs covering edge cases
  • Need correct expected outputs
  • Manual creation doesn’t scale

Solution: Use an LLM to generate test cases

from pydantic_evals.generation import generate_dataset

# Generate dataset for sentiment analysis
generated_dataset = await generate_dataset(
    dataset_type=Dataset[str, str],  # Input and output types
    n_examples=5,  # Generate 5 test cases
    extra_instructions=(
        "Create diverse restaurant reviews with clear sentiment. "
        "Input should be a restaurant review. "
        "Output should be the sentiment: positive, negative, or neutral."
    ),
    model='anthropic:claude-haiku-4-5'
)

print(f"Generated {len(generated_dataset.cases)} test cases")
for i, case in enumerate(generated_dataset.cases[:3]):
    print(f"\nCase {i+1}:")
    print(f"  Input: {case.inputs}")
    print(f"  Expected: {case.expected_output}")

How it works:

  1. You specify the dataset type with input/output types (e.g., Dataset[str, str])
  2. LLM generates diverse test scenarios based on your instructions
  3. Returns properly structured Dataset object with generated cases
  4. Can optionally save to file for version control (using path parameter)

Best practices:

  • Review generated cases before using
  • Mix generated and hand-crafted cases
  • Regenerate periodically to expand coverage
  • Use extra_instructions to guide the LLM toward specific edge cases

Evaluating RAG Systems: A Two-Stage Challenge

What is RAG?

  • RAG = Retrieval-Augmented Generation
  • Common pattern for AI agents that need to answer questions about documents/data
  • Two stages:
    1. Retrieval: Find relevant documents/passages from knowledge base
    2. Generation: Use retrieved context to generate answer

Why RAG Evaluation is Different

  • Traditional evaluation: just check the final answer
  • RAG evaluation: need to check both stages
    • Is the retrieval finding the right documents?
    • Is the generation using those documents correctly?
  • Failure can happen at either stage (or both!)

Example Failure Modes:

  • ✓ Retrieval works, ✗ Generation fails: Found right docs, but hallucinated answer
  • ✗ Retrieval fails, ✓ Generation works: Couldn’t find relevant docs, so generated plausible but wrong answer
  • ✗ Both fail: Retrieved irrelevant docs and made up information

RAG Evaluation Metrics

Retrieval Metrics (Is the retrieval working?)

class PrecisionAtK(Evaluator):
    """Check if retrieved documents are relevant."""

    def __init__(self, k: int = 5):
        self.k = k

    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        # Assume ctx.metadata contains retrieved doc IDs and ground truth
        retrieved_docs = ctx.metadata.get('retrieved_doc_ids', [])[:self.k]
        relevant_docs = ctx.metadata.get('relevant_doc_ids', [])

        relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
        precision = len(relevant_retrieved) / self.k if self.k > 0 else 0

        return EvaluationReason(
            value=precision,
            explanation=f"Retrieved {len(relevant_retrieved)}/{self.k} relevant docs"
        )

class RecallAtK(Evaluator):
    """Check if all relevant documents were found."""

    def __init__(self, k: int = 5):
        self.k = k

    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        retrieved_docs = ctx.metadata.get('retrieved_doc_ids', [])[:self.k]
        relevant_docs = ctx.metadata.get('relevant_doc_ids', [])

        relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
        recall = len(relevant_retrieved) / len(relevant_docs) if len(relevant_docs) > 0 else 0

        return EvaluationReason(
            value=recall,
            explanation=f"Found {len(relevant_retrieved)}/{len(relevant_docs)} relevant docs"
        )

print("Created RAG retrieval evaluators")

Generation Metrics (Is the generation working?)

class Faithfulness(Evaluator):
    """Check if answer is grounded in retrieved context."""

    async def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        answer = ctx.output
        context = ctx.metadata.get('retrieved_context', '')

        # Use LLM to judge faithfulness
        judge = Agent('anthropic:claude-haiku-4-5')
        result = await judge.run(
            f"Context: {context}\n\n"
            f"Answer: {answer}\n\n"
            f"Is the answer fully supported by the context? "
            f"Respond with YES, NO, or PARTIAL and explain why."
        )

        assessment = result.data
        is_faithful = assessment.startswith('YES')

        return EvaluationReason(
            value=is_faithful,
            explanation=assessment
        )

print("Created Faithfulness evaluator")

Best Practices for RAG Evaluation

1. Evaluate Stages Independently

Create separate datasets for retrieval and generation to pinpoint where failures occur.

# Example: Retrieval evaluation dataset
retrieval_dataset = Dataset(
    name="Retrieval Quality",
    cases=[
        Case(
            inputs={"query": "What are the return policies?"},
            metadata={
                "relevant_doc_ids": ["doc_42", "doc_87"],  # Ground truth
                "retrieved_doc_ids": ["doc_42", "doc_87", "doc_13", "doc_99", "doc_5"],  # Simulated retrieval
            }
        ),
    ],
    evaluators=[
        PrecisionAtK(k=5),
        RecallAtK(k=5),
    ]
)

print("Created retrieval evaluation dataset")

Why separate?

  • Pinpoints where failures occur
  • Can optimize retrieval and generation independently
  • Clearer diagnosis: “Our retrieval is great but generation hallucinates” vs “Both need work”

Exercise: Design RAG Evaluation

Scenario: You’re building a RAG system that answers questions about a company’s internal documentation.

Tasks:

  1. Identify failure modes: What are 3 ways this RAG system could fail?
  2. Design retrieval tests: What cases would test if retrieval is working?
    • What queries should always retrieve specific documents?
    • What edge cases might break retrieval?
  3. Design generation tests: Assuming perfect retrieval, how do you test generation?
    • What makes a “good” answer?
    • How do you detect hallucinations?
  4. Create evaluation pipeline: Sketch code for evaluating both stages
    • What metrics would you track?
    • How would you report results?
# TODO: Your code here

Integration and Best Practices

When to Use Each Evaluation Type

Deterministic Evaluators when:

  • Clear right/wrong answers exist
  • Output format matters (structured data)
  • Security/safety constraints (no PII leakage)
  • Performance requirements (latency, cost)

LLM as Judge when:

  • Multiple valid answers exist
  • Quality is subjective (helpfulness, tone)
  • Semantic equivalence matters (“Paris” vs “The capital of France is Paris”)

Span-Based Evaluation when:

  • Process correctness matters, not just output
  • Multi-step reasoning needs validation
  • Tool usage patterns are important
  • Debugging complex agent behaviors

Tips for Effective Evaluation

Start Small, Grow Gradually:

  • Begin with 5-10 cases covering main scenarios
  • Add cases as you discover failures
  • Prioritize cases that would impact users most

Balance Coverage and Maintainability:

  • Don’t try to test everything
  • Focus on high-risk or high-value scenarios
  • Remove redundant cases

Make Evaluators Specific and Clear:

  • Good rubric: “Score 0-10 on factual accuracy. Check claims against provided context.”
  • Bad rubric: “Score the quality of the response.”

Version Control Your Datasets:

  • Store datasets as YAML/JSON in git
  • Track changes over time
  • Share across team

Automate Where Possible:

  • Run evals in CI/CD pipeline
  • Block deploys if pass rate drops
  • Generate alerts for regressions

Exercise: Evaluation Strategy Design

Choose one scenario:

  1. E-commerce support agent: Handles returns, shipping, account questions
  2. Code review agent: Reviews pull requests, suggests improvements
  3. Data analysis agent: Answers questions about datasets using pandas

For your chosen scenario:

  1. Design an evaluation strategy:
    • What’s in your test dataset? (10+ cases)
    • What evaluators would you use?
    • How would you measure success?
  2. Describe your development workflow:
    • When do you run evals?
    • What metrics do you track?
    • How do you decide when to deploy?
  3. Plan for production:
    • How do you handle failures?
    • When do you update your dataset?
    • What triggers re-evaluation?

Connections to Course Themes

Game Theory and Evaluation

  • Agent alignment: Evaluation as mechanism design

    • You design rubrics (rules) to incentivize desired behaviors
    • LLM as Judge is like a referee in a game
    • Pass/fail thresholds create strategic constraints
  • Adversarial evaluation: Red team vs Blue team

    • Attackers try to make agent fail (jailbreaking, prompt injection)
    • Defenders build evals that catch these attacks
    • Nash equilibrium between robustness and capability

Network Effects in AI Systems

  • Evaluation datasets as networks:

    • Cases can have dependencies (one builds on another)
    • Failures can cascade (if base functionality breaks, many cases fail)
    • Coverage metrics: are there “clusters” of untested scenarios?
  • Agent-to-agent evaluation:

    • Multi-agent systems need coordinated evaluation
    • Agent A’s outputs become Agent B’s inputs
    • Network of evals reflects agent interaction topology

Emergence in Complex AI Systems

  • Emergent behaviors in multi-step agents:

    • Simple agent + simple tools → complex behaviors
    • Can’t predict all outcomes from components
    • Evaluation discovers emergent capabilities (and failures)
  • Evaluation as an ABM simulation:

    • Each test case is like running the simulation once
    • Aggregate results reveal patterns
    • Edge cases show boundary conditions of agent “behavior space”

Summary and Key Takeaways

What We Learned

  1. Why Evals Matter

    • AI systems are non-deterministic
    • Systematic testing catches issues before production
    • Evals enable confident iteration and deployment
  2. Core Framework: Pydantic Evals

    • Cases: individual test scenarios
    • Datasets: collections of cases
    • Evaluators: scoring mechanisms (deterministic, LLM, custom)
    • Experiments: runs that generate reports
  3. Evaluation Strategies

    • Deterministic checks for clear criteria
    • LLM as Judge for subjective quality
    • Span-based evaluation for process correctness
    • Custom evaluators for domain-specific needs
    • RAG-specific metrics for retrieval + generation
  4. Best Practices

    • Start small, iterate based on failures
    • Balance coverage with maintainability
    • Version control datasets and track metrics
    • Integrate into development and deployment workflows

Looking Forward

  • Evaluations are “an emerging art/science”
  • No single “right” approach exists
  • Adapt techniques to your domain and constraints
  • Key principle: Test systematically, deploy confidently

Final Exercise: Reflection

Think about an AI agent you might build:

  1. What are the top 3 risks or failure modes?
  2. How would you design evals to catch those?
  3. What would “success” look like quantitatively?