Evaluating AI Systems: Testing Agents at Scale

Computational Analysis of Social Complexity
Fall 2025, Spencer Lyon

Prerequisites

Pydantic AI Agents and Tools
Python programming fundamentals
Basic understanding of testing concepts

Outcomes

Understand why systematic evaluation is critical for AI systems
Identify when and what to evaluate in AI agents
Implement deterministic and LLM-based evaluators
Design evaluation datasets using code-first approaches
Analyze and compare evaluation results across experiments
Connect evaluation practices to production deployment concerns

References

Pydantic AI Evals Documentation
Pydantic Evals API Reference
Testing practices from software engineering

Introduction: The AI Testing Problem¶

Why Testing AI is Different¶

Traditional software testing: deterministic inputs → deterministic outputs
- If add(2, 3) returns 5 once, it always will
- Clear pass/fail criteria
AI systems: same input → potentially different outputs
- Ask an agent “What’s the capital of France?” twice, might get:
  - “The capital of France is Paris.”
  - “Paris is France’s capital city.”
  - “France’s capital is Paris, known for the Eiffel Tower.”
- All correct, but different!
Key challenge: How do we test something that’s non-deterministic?

Motivating Scenario: The Support Bot Problem¶

You’ve built a customer support agent using Pydantic AI
It handles questions about product returns, shipping, and account issues
In development, it seems to work great on the examples you tried
You deploy to production and...
- Sometimes gives outdated return policy information
- Occasionally “hallucinates” shipping partners that don’t exist
- Has trouble with edge cases you didn’t think to test
Question: How could systematic evaluation have caught these issues?

What We’ll Learn¶

We’ll explore Pydantic Evals, a framework for testing AI systems
Three main components:
1. Datasets: Collections of test scenarios
2. Evaluators: Scoring mechanisms to check outputs
3. Experiments: Runs that combine datasets and evaluators
Think of it like unit testing for AI:
- Cases + Evaluators = individual unit tests
- Datasets = test suites
- Experiments = running your entire test suite

Setup¶

Install and Import Dependencies¶

We’ll need Pydantic AI and the evals module:

# Install pydantic-ai if needed
# %pip install pydantic-ai[anthropic,evals] pydantic-evals

from pydantic_ai import Agent
from pydantic_evals import Dataset, Case
from pydantic_evals.evaluators import (
    Evaluator,
    EvaluatorContext,
    EvaluationReason,
    EqualsExpected,
    Contains,
    IsInstance,
    MaxDuration,
    LLMJudge,
    HasMatchingSpan,
)
from pydantic import BaseModel
import os

Configure API Keys¶

Make sure you have your Anthropic API key set:

from dotenv import load_dotenv
import nest_asyncio

load_dotenv()
nest_asyncio.apply()

# Your API key should be set in environment
# os.environ["ANTHROPIC_API_KEY"] = "your-key-here"

# Verify it's set
assert "ANTHROPIC_API_KEY" in os.environ, "Please set ANTHROPIC_API_KEY environment variable"

Core Concepts: The Evaluation Framework¶

Structure: Cases, Datasets, and Experiments¶

Cases: Individual test scenarios

Like a single unit test
Contains:
- inputs: Data you pass to the agent
- expected_output: (Optional) What you expect back
- metadata: (Optional) Context about this test
- evaluators: (Optional) Case-specific checks

Datasets: Collections of cases

Like a test suite
Groups related test scenarios
Can have dataset-level evaluators that apply to all cases

Experiments: Evaluation runs

Like running pytest or julia test
Executes your task function on all cases
Applies evaluators to score outputs
Generates reports with results

Example: Simple Support Bot Dataset¶

Let’s build intuition with a concrete example.

Suppose we want to test our support bot’s ability to identify the user’s intent:

# Define test cases
intent_dataset = Dataset[str, str](
    name="Intent Classification Tests",
    cases=[
        Case(
            name="return_request",
            inputs="I want to return my order",
            expected_output="return"
        ),
        Case(
            name="shipping_status",
            inputs="Where is my package?",
            expected_output="shipping"
        ),
        Case(
            name="account_question",
            inputs="How do I reset my password?",
            expected_output="account"
        )
    ]
)

print(f"Dataset: {intent_dataset.name}")
print(f"Number of cases: {len(intent_dataset.cases)}")

Dataset: Intent Classification Tests
Number of cases: 3

Note the pattern:

Each case tests one scenario
We specify what we expect
Cases are typed: Dataset[str, str] means string inputs → string outputs

Exercise 1: Design Your Own Cases¶

Scenario: You’re building a sentiment analysis agent for product reviews.

Task: Following the pattern above, create a Dataset[str, str] with 3-5 test cases for an agent that should classify product reviews as:

“positive”
“negative”
“neutral”

Think about:

What review texts represent typical positive/negative/neutral cases?
What are edge cases? (e.g., mixed sentiments, sarcasm)
Make sure each Case has:
- A descriptive name
- inputs (the review text)
- expected_output (the sentiment label)

# TODO: Your code here
# Create a sentiment analysis dataset following the pattern above

sentiment_dataset = Dataset[str, str](
    name="Product Sentiment Analysis",
    cases=[
        # TODO: Add your test cases here
        # Case(
        #     name="clearly_positive",
        #     inputs="This product exceeded my expectations! ...",
        #     expected_output="positive"
        # ),
    ]
)

print(f"Created dataset with {len(sentiment_dataset.cases)} cases")

Created dataset with 0 cases

Evaluators: How to Score Outputs¶

Two Types of Evaluation¶

Deterministic Evaluators: Code-based checks

Exact matches
Type checking
Format validation (email, phone number, URL)
PII detection
Regular expression matching

Non-Deterministic Evaluators: Subjective assessment

LLM as judge
Human evaluation
Quality metrics (accuracy, relevance, helpfulness)

Built-in Evaluators¶

Pydantic Evals provides several ready-made evaluators:

1. Exact Matching¶

# Check if output equals expected value
evaluator = EqualsExpected()

print(f"Evaluator: {evaluator}")

Evaluator: EqualsExpected()

2. Type Checking¶

# Ensure output is correct type
evaluator = IsInstance('str')

print(f"Evaluator: {evaluator}")

Evaluator: IsInstance(type_name='str')

3. Membership/Contains¶

# Check if key phrase appears in output
evaluator = Contains('return policy')

print(f"Evaluator: {evaluator}")

Evaluator: Contains(value='return policy')

4. Performance Constraints¶

# Ensure agent responds quickly enough
evaluator = MaxDuration(seconds=2.0)  # 2 seconds max

print(f"Evaluator: {evaluator}")

Evaluator: MaxDuration(seconds=2.0)

Example: Adding Evaluators to Dataset¶

intent_dataset_with_evals = Dataset[str, str](
    name="Intent Classification Tests",
    cases=[
        Case(
            name="return_request",
            inputs="I want to return my order",
            expected_output="return"
        ),
        Case(
            name="shipping_status",
            inputs="Where is my package?",
            expected_output="shipping"
        ),
        Case(
            name="account_question",
            inputs="How do I reset my password?",
            expected_output="account"
        ),
    ],
    evaluators=[
        EqualsExpected(),  # Applied to all cases
        MaxDuration(seconds=2.0)  # Response time check
    ]
)

print(f"Dataset has {len(intent_dataset_with_evals.evaluators)} evaluators")
print(f"Dataset has {len(intent_dataset_with_evals.cases)} cases")

Dataset has 2 evaluators
Dataset has 3 cases

LLM as Judge: When Correctness is Subjective¶

Sometimes there’s no single “correct” answer
Example: “Write a friendly response to this complaint”
- Many valid responses exist
- Hard to check with deterministic rules
Solution: Use another LLM to evaluate

LLMJudge Evaluator:¶

judge = LLMJudge(
    rubric=(
        "Score from 0-10 on friendliness and helpfulness. "
        "Friendly responses should acknowledge the customer's frustration. "
        "Helpful responses should offer concrete next steps."
    ),
    model='anthropic:claude-haiku-4-5'
)

print("Created LLMJudge evaluator")

Created LLMJudge evaluator

How it works:

Your agent generates an output
LLMJudge sends that output + rubric to an LLM
LLM scores the output based on the rubric
Score is recorded in the evaluation report

Custom Evaluators: Domain-Specific Checks¶

You can create custom evaluators by subclassing Evaluator:

class ContainsURL(Evaluator):
    """Check if output contains a valid URL."""

    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        output = ctx.output
        # Simple URL detection
        has_url = 'http://' in str(output) or 'https://' in str(output)

        return EvaluationReason(
            value=has_url,
            explanation="URL found" if has_url else "No URL found"
        )

# Test it
url_checker = ContainsURL()
print("Created custom ContainsURL evaluator")

Created custom ContainsURL evaluator

Key points:

Implement evaluate method (can be sync or async)
Access inputs/outputs through EvaluatorContext
Return EvaluationReason with value and explanation

Exercise 2: Design Evaluators¶

For the sentiment analysis agent from Exercise 1:

What deterministic evaluators would you use?
- Think about checking exact label matches, valid sentiment values
What subjective aspects might need LLM evaluation?
- Example: When a review has mixed sentiment, is the chosen label reasonable?
- What would your LLMJudge rubric say?
Design one custom evaluator for a domain-specific check
- Example: “Output should be all lowercase” or “Response time should be fast”

# TODO: Your code here
# Create evaluators for sentiment analysis

# Example deterministic evaluator
# sentiment_dataset.evaluators.append(EqualsExpected())

# Example custom evaluator
# class ValidSentiment(Evaluator):
#     """Check if output is a valid sentiment label."""
#     def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
#         valid_sentiments = {"positive", "negative", "neutral"}
#         is_valid = ctx.output.lower() in valid_sentiments
#         return EvaluationReason(
#             value=is_valid,
#             explanation=f"Output is {'valid' if is_valid else 'invalid'} sentiment"
#         )

Running Evaluations: From Code to Reports¶

The Evaluation Loop¶

Step 1: Define your task function¶

# Create a simple intent classifier agent
support_agent = Agent(
    'anthropic:claude-haiku-4-5',
    system_prompt=(
        "You are a customer support intent classifier. "
        "Classify user messages into one of: return, shipping, account. "
        "Respond with ONLY the classification label, nothing else."
    )
)

async def classify_intent(inputs: str) -> str:
    """Our task: classify customer message intent."""
    result = await support_agent.run(inputs)
    return result.output.lower().strip()

print("Created intent classification agent")

Created intent classification agent

Step 2: Run evaluation¶

# Run experiment
report = await intent_dataset_with_evals.evaluate(
    task=classify_intent,
    max_concurrency=3,  # Run 3 cases in parallel
    progress=True  # Show progress bar
)

print("\nEvaluation complete!")
print(f"Evaluated {len(report.cases)} cases")

Step 3: Analyze results¶

# Print formatted report
report.print()

print("\n" + "="*50)
print("Detailed results:")
print("="*50)

# Or get the data programmatically
for case in report.cases:
    print(f"\nCase: {case.name}")
    print(f"  Input: {case.inputs}")
    print(f"  Output: {case.output}")
    print(f"  Expected: {case.expected_output}")
    print(f"  Assertions: {case.assertions}")
    print(f"  All passed: {all(case.assertions.values())}")

Visualizing Results with Logfire¶

While the printed reports are useful, Pydantic Logfire provides a web UI for visualizing and analyzing evaluation results over time.

Why use Logfire?

Interactive dashboards for evaluation metrics
Trace exploration for debugging failed cases
Track trends across multiple evaluation runs
Team collaboration and sharing results

Let’s configure Logfire integration:

Note: With Logfire configured, all subsequent evaluations will automatically send their results to the Logfire web UI. You can then:

View evaluation reports in an interactive dashboard
Explore individual traces and spans
Compare results across multiple runs
Set up alerts for failing evaluations

Visit logfire.pydantic.dev to view your evaluation results.

Understanding Evaluation Results¶

import logfire

# Configure Logfire
# This will automatically send evaluation results to Logfire if token is present
logfire.configure(
    send_to_logfire='if-token-present',

)

print("Logfire configured!")
print("Future evaluations will automatically appear in Logfire web UI")

Logfire configured!
Future evaluations will automatically appear in Logfire web UI

Logfire project URL: ]8;id=50384;https://logfire-us.pydantic.dev/sglyon/cap-6318-example\https://logfire-us.pydantic.dev/sglyon/cap-6318-example]8;;\

Comparing Experiments: Tracking Improvements¶

Let’s modify our agent and compare results:

# Create an improved agent with better prompt
improved_agent = Agent(
    'anthropic:claude-haiku-4-5',
    system_prompt=(
        "You are a customer support intent classifier. "
        "Classify user messages into exactly one of these categories:\n"
        "- return: for return/refund requests\n"
        "- shipping: for delivery/tracking questions\n"
        "- account: for login/password/profile issues\n\n"
        "Respond with ONLY the classification label in lowercase, nothing else."
    )
)

async def improved_classify_intent(inputs: str) -> str:
    """Improved task function."""
    result = await improved_agent.run(inputs)
    return result.output.lower().strip()

# Run evaluation with improved agent
improved_report = await intent_dataset_with_evals.evaluate(
    task=improved_classify_intent,
    max_concurrency=3,
    progress=True
)

# Compare against baseline
improved_report.print(baseline=report)

Exercise 3: Run Your First Evaluation¶

Using the sentiment analysis agent you designed in Exercise 1:

Create an agent that performs sentiment classification
Use your dataset from Exercise 1 (add more cases if needed)
Add 2-3 evaluators (mix of built-in and custom from Exercise 2)
Run the evaluation using await dataset.evaluate(task=your_function)
Interpret the results:
- Which cases passed/failed?
- What patterns do you notice?
- What would you improve about the agent or the test cases?

# TODO: Your code here

Advanced Topics: Span-Based Evaluation and Dataset Generation¶

Span-Based Evaluation: Evaluating the Process, Not Just the Output¶

The Problem: Sometimes the final answer is correct, but the how matters

Example: Math problem solving
- Output: “42” ✓ Correct!
- But did the agent:
  - Use the right formula?
  - Show its work?
  - Make calculation errors that happened to cancel out?

Spans: Execution traces from OpenTelemetry

Capture what the agent did internally
Tool calls made
LLM requests and responses
Intermediate reasoning steps

HasMatchingSpan Evaluator:¶

# Ensure agent called a specific tool
span_evaluator = HasMatchingSpan(
    query={'name_contains': 'calculator_tool'},
    evaluation_name='used_calculator'
)

print("Created span-based evaluator")
print("This evaluator checks that a span named 'calculator_tool' was called")

Why this matters:

Catches “lucky guesses” where agent gets answer right for wrong reasons
Validates agent is following intended reasoning process
Useful for multi-step tasks where process correctness matters

Generating Datasets with LLMs¶

The Challenge: Creating comprehensive test datasets is tedious

Need diverse inputs covering edge cases
Need correct expected outputs
Manual creation doesn’t scale

Solution: Use an LLM to generate test cases

from pydantic_evals.generation import generate_dataset

# Generate dataset for sentiment analysis
generated_dataset = await generate_dataset(
    dataset_type=Dataset[str, str],  # Input and output types
    n_examples=5,  # Generate 5 test cases
    extra_instructions=(
        "Create diverse restaurant reviews with clear sentiment. "
        "Input should be a restaurant review. "
        "Output should be the sentiment: positive, negative, or neutral."
    ),
    model='anthropic:claude-haiku-4-5'
)

print(f"Generated {len(generated_dataset.cases)} test cases")
for i, case in enumerate(generated_dataset.cases[:3]):
    print(f"\nCase {i+1}:")
    print(f"  Input: {case.inputs}")
    print(f"  Expected: {case.expected_output}")

How it works:

You specify the dataset type with input/output types (e.g., Dataset[str, str])
LLM generates diverse test scenarios based on your instructions
Returns properly structured Dataset object with generated cases
Can optionally save to file for version control (using path parameter)

Best practices:

Review generated cases before using
Mix generated and hand-crafted cases
Regenerate periodically to expand coverage
Use extra_instructions to guide the LLM toward specific edge cases

Evaluating RAG Systems: A Two-Stage Challenge¶

What is RAG?¶

RAG = Retrieval-Augmented Generation
Common pattern for AI agents that need to answer questions about documents/data
Two stages:
1. Retrieval: Find relevant documents/passages from knowledge base
2. Generation: Use retrieved context to generate answer

Why RAG Evaluation is Different¶

Traditional evaluation: just check the final answer
RAG evaluation: need to check both stages
- Is the retrieval finding the right documents?
- Is the generation using those documents correctly?
Failure can happen at either stage (or both!)

Example Failure Modes:

✓ Retrieval works, ✗ Generation fails: Found right docs, but hallucinated answer
✗ Retrieval fails, ✓ Generation works: Couldn’t find relevant docs, so generated plausible but wrong answer
✗ Both fail: Retrieved irrelevant docs and made up information

RAG Evaluation Metrics¶

Retrieval Metrics (Is the retrieval working?)

class PrecisionAtK(Evaluator):
    """Check if retrieved documents are relevant."""

    def __init__(self, k: int = 5):
        self.k = k

    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        # Assume ctx.metadata contains retrieved doc IDs and ground truth
        retrieved_docs = ctx.metadata.get('retrieved_doc_ids', [])[:self.k]
        relevant_docs = ctx.metadata.get('relevant_doc_ids', [])

        relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
        precision = len(relevant_retrieved) / self.k if self.k > 0 else 0

        return EvaluationReason(
            value=precision,
            explanation=f"Retrieved {len(relevant_retrieved)}/{self.k} relevant docs"
        )

class RecallAtK(Evaluator):
    """Check if all relevant documents were found."""

    def __init__(self, k: int = 5):
        self.k = k

    def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        retrieved_docs = ctx.metadata.get('retrieved_doc_ids', [])[:self.k]
        relevant_docs = ctx.metadata.get('relevant_doc_ids', [])

        relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
        recall = len(relevant_retrieved) / len(relevant_docs) if len(relevant_docs) > 0 else 0

        return EvaluationReason(
            value=recall,
            explanation=f"Found {len(relevant_retrieved)}/{len(relevant_docs)} relevant docs"
        )

print("Created RAG retrieval evaluators")

Generation Metrics (Is the generation working?)

class Faithfulness(Evaluator):
    """Check if answer is grounded in retrieved context."""

    async def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
        answer = ctx.output
        context = ctx.metadata.get('retrieved_context', '')

        # Use LLM to judge faithfulness
        judge = Agent('anthropic:claude-haiku-4-5')
        result = await judge.run(
            f"Context: {context}\n\n"
            f"Answer: {answer}\n\n"
            f"Is the answer fully supported by the context? "
            f"Respond with YES, NO, or PARTIAL and explain why."
        )

        assessment = result.data
        is_faithful = assessment.startswith('YES')

        return EvaluationReason(
            value=is_faithful,
            explanation=assessment
        )

print("Created Faithfulness evaluator")

Best Practices for RAG Evaluation¶

1. Evaluate Stages Independently

Create separate datasets for retrieval and generation to pinpoint where failures occur.

# Example: Retrieval evaluation dataset
retrieval_dataset = Dataset(
    name="Retrieval Quality",
    cases=[
        Case(
            inputs={"query": "What are the return policies?"},
            metadata={
                "relevant_doc_ids": ["doc_42", "doc_87"],  # Ground truth
                "retrieved_doc_ids": ["doc_42", "doc_87", "doc_13", "doc_99", "doc_5"],  # Simulated retrieval
            }
        ),
    ],
    evaluators=[
        PrecisionAtK(k=5),
        RecallAtK(k=5),
    ]
)

print("Created retrieval evaluation dataset")

Why separate?

Pinpoints where failures occur
Can optimize retrieval and generation independently
Clearer diagnosis: “Our retrieval is great but generation hallucinates” vs “Both need work”

Exercise: Design RAG Evaluation¶

Scenario: You’re building a RAG system that answers questions about a company’s internal documentation.

Tasks:

Identify failure modes: What are 3 ways this RAG system could fail?
Design retrieval tests: What cases would test if retrieval is working?
- What queries should always retrieve specific documents?
- What edge cases might break retrieval?
Design generation tests: Assuming perfect retrieval, how do you test generation?
- What makes a “good” answer?
- How do you detect hallucinations?
Create evaluation pipeline: Sketch code for evaluating both stages
- What metrics would you track?
- How would you report results?

# TODO: Your code here

Integration and Best Practices¶

When to Use Each Evaluation Type¶

Deterministic Evaluators when:

Clear right/wrong answers exist
Output format matters (structured data)
Security/safety constraints (no PII leakage)
Performance requirements (latency, cost)

LLM as Judge when:

Multiple valid answers exist
Quality is subjective (helpfulness, tone)
Semantic equivalence matters (“Paris” vs “The capital of France is Paris”)

Span-Based Evaluation when:

Process correctness matters, not just output
Multi-step reasoning needs validation
Tool usage patterns are important
Debugging complex agent behaviors

Tips for Effective Evaluation¶

Start Small, Grow Gradually:

Begin with 5-10 cases covering main scenarios
Add cases as you discover failures
Prioritize cases that would impact users most

Balance Coverage and Maintainability:

Don’t try to test everything
Focus on high-risk or high-value scenarios
Remove redundant cases

Make Evaluators Specific and Clear:

Good rubric: “Score 0-10 on factual accuracy. Check claims against provided context.”
Bad rubric: “Score the quality of the response.”

Version Control Your Datasets:

Store datasets as YAML/JSON in git
Track changes over time
Share across team

Automate Where Possible:

Run evals in CI/CD pipeline
Block deploys if pass rate drops
Generate alerts for regressions

Exercise: Evaluation Strategy Design¶

Choose one scenario:

E-commerce support agent: Handles returns, shipping, account questions
Code review agent: Reviews pull requests, suggests improvements
Data analysis agent: Answers questions about datasets using pandas

For your chosen scenario:

Design an evaluation strategy:
- What’s in your test dataset? (10+ cases)
- What evaluators would you use?
- How would you measure success?
Describe your development workflow:
- When do you run evals?
- What metrics do you track?
- How do you decide when to deploy?
Plan for production:
- How do you handle failures?
- When do you update your dataset?
- What triggers re-evaluation?

Connections to Course Themes¶

Game Theory and Evaluation¶

Agent alignment: Evaluation as mechanism design
- You design rubrics (rules) to incentivize desired behaviors
- LLM as Judge is like a referee in a game
- Pass/fail thresholds create strategic constraints
Adversarial evaluation: Red team vs Blue team
- Attackers try to make agent fail (jailbreaking, prompt injection)
- Defenders build evals that catch these attacks
- Nash equilibrium between robustness and capability

Network Effects in AI Systems¶

Evaluation datasets as networks:
- Cases can have dependencies (one builds on another)
- Failures can cascade (if base functionality breaks, many cases fail)
- Coverage metrics: are there “clusters” of untested scenarios?
Agent-to-agent evaluation:
- Multi-agent systems need coordinated evaluation
- Agent A’s outputs become Agent B’s inputs
- Network of evals reflects agent interaction topology

Emergence in Complex AI Systems¶

Emergent behaviors in multi-step agents:
- Simple agent + simple tools → complex behaviors
- Can’t predict all outcomes from components
- Evaluation discovers emergent capabilities (and failures)
Evaluation as an ABM simulation:
- Each test case is like running the simulation once
- Aggregate results reveal patterns
- Edge cases show boundary conditions of agent “behavior space”

Summary and Key Takeaways¶

What We Learned¶

Why Evals Matter
- AI systems are non-deterministic
- Systematic testing catches issues before production
- Evals enable confident iteration and deployment
Core Framework: Pydantic Evals
- Cases: individual test scenarios
- Datasets: collections of cases
- Evaluators: scoring mechanisms (deterministic, LLM, custom)
- Experiments: runs that generate reports
Evaluation Strategies
- Deterministic checks for clear criteria
- LLM as Judge for subjective quality
- Span-based evaluation for process correctness
- Custom evaluators for domain-specific needs
- RAG-specific metrics for retrieval + generation
Best Practices
- Start small, iterate based on failures
- Balance coverage with maintainability
- Version control datasets and track metrics
- Integrate into development and deployment workflows

Looking Forward¶

Evaluations are “an emerging art/science”
No single “right” approach exists
Adapt techniques to your domain and constraints
Key principle: Test systematically, deploy confidently

Final Exercise: Reflection¶

Think about an AI agent you might build:

What are the top 3 risks or failure modes?
How would you design evals to catch those?
What would “success” look like quantitatively?

Additional Resources¶

Agents 2

Type-Safe Agent Development with PydanticAI Patterns

Agents 3

Building Agent Tools with FastMCP