Computational Analysis of Social Complexity
Fall 2025, Spencer Lyon
Prerequisites
- Pydantic AI Agents and Tools
- Python programming fundamentals
- Basic understanding of testing concepts
Outcomes
- Understand why systematic evaluation is critical for AI systems
- Identify when and what to evaluate in AI agents
- Implement deterministic and LLM-based evaluators
- Design evaluation datasets using code-first approaches
- Analyze and compare evaluation results across experiments
- Connect evaluation practices to production deployment concerns
References
- Pydantic AI Evals Documentation
- Pydantic Evals API Reference
- Testing practices from software engineering
Introduction: The AI Testing Problem¶
Why Testing AI is Different¶
- Traditional software testing: deterministic inputs → deterministic outputs
- If
add(2, 3)returns5once, it always will - Clear pass/fail criteria
- If
- AI systems: same input → potentially different outputs
- Ask an agent “What’s the capital of France?” twice, might get:
- “The capital of France is Paris.”
- “Paris is France’s capital city.”
- “France’s capital is Paris, known for the Eiffel Tower.”
- All correct, but different!
- Ask an agent “What’s the capital of France?” twice, might get:
- Key challenge: How do we test something that’s non-deterministic?
Motivating Scenario: The Support Bot Problem¶
- You’ve built a customer support agent using Pydantic AI
- It handles questions about product returns, shipping, and account issues
- In development, it seems to work great on the examples you tried
- You deploy to production and...
- Sometimes gives outdated return policy information
- Occasionally “hallucinates” shipping partners that don’t exist
- Has trouble with edge cases you didn’t think to test
- Question: How could systematic evaluation have caught these issues?
What We’ll Learn¶
- We’ll explore Pydantic Evals, a framework for testing AI systems
- Three main components:
- Datasets: Collections of test scenarios
- Evaluators: Scoring mechanisms to check outputs
- Experiments: Runs that combine datasets and evaluators
- Think of it like unit testing for AI:
- Cases + Evaluators = individual unit tests
- Datasets = test suites
- Experiments = running your entire test suite
Setup¶
Install and Import Dependencies¶
We’ll need Pydantic AI and the evals module:
# Install pydantic-ai if needed
# %pip install pydantic-ai[anthropic,evals] pydantic-evalsfrom pydantic_ai import Agent
from pydantic_evals import Dataset, Case
from pydantic_evals.evaluators import (
Evaluator,
EvaluatorContext,
EvaluationReason,
EqualsExpected,
Contains,
IsInstance,
MaxDuration,
LLMJudge,
HasMatchingSpan,
)
from pydantic import BaseModel
import osConfigure API Keys¶
Make sure you have your Anthropic API key set:
from dotenv import load_dotenv
import nest_asyncio
load_dotenv()
nest_asyncio.apply()# Your API key should be set in environment
# os.environ["ANTHROPIC_API_KEY"] = "your-key-here"
# Verify it's set
assert "ANTHROPIC_API_KEY" in os.environ, "Please set ANTHROPIC_API_KEY environment variable"Core Concepts: The Evaluation Framework¶
Structure: Cases, Datasets, and Experiments¶
Cases: Individual test scenarios
- Like a single unit test
- Contains:
inputs: Data you pass to the agentexpected_output: (Optional) What you expect backmetadata: (Optional) Context about this testevaluators: (Optional) Case-specific checks
Datasets: Collections of cases
- Like a test suite
- Groups related test scenarios
- Can have dataset-level evaluators that apply to all cases
Experiments: Evaluation runs
- Like running
pytestorjulia test - Executes your task function on all cases
- Applies evaluators to score outputs
- Generates reports with results
Example: Simple Support Bot Dataset¶
Let’s build intuition with a concrete example.
Suppose we want to test our support bot’s ability to identify the user’s intent:
# Define test cases
intent_dataset = Dataset[str, str](
name="Intent Classification Tests",
cases=[
Case(
name="return_request",
inputs="I want to return my order",
expected_output="return"
),
Case(
name="shipping_status",
inputs="Where is my package?",
expected_output="shipping"
),
Case(
name="account_question",
inputs="How do I reset my password?",
expected_output="account"
)
]
)
print(f"Dataset: {intent_dataset.name}")
print(f"Number of cases: {len(intent_dataset.cases)}")Dataset: Intent Classification Tests
Number of cases: 3
Note the pattern:
- Each case tests one scenario
- We specify what we expect
- Cases are typed:
Dataset[str, str]means string inputs → string outputs
Exercise 1: Design Your Own Cases¶
Scenario: You’re building a sentiment analysis agent for product reviews.
Task: Following the pattern above, create a Dataset[str, str] with 3-5 test cases for an agent that should classify product reviews as:
- “positive”
- “negative”
- “neutral”
Think about:
- What review texts represent typical positive/negative/neutral cases?
- What are edge cases? (e.g., mixed sentiments, sarcasm)
- Make sure each Case has:
- A descriptive
name inputs(the review text)expected_output(the sentiment label)
- A descriptive
# TODO: Your code here
# Create a sentiment analysis dataset following the pattern above
sentiment_dataset = Dataset[str, str](
name="Product Sentiment Analysis",
cases=[
# TODO: Add your test cases here
# Case(
# name="clearly_positive",
# inputs="This product exceeded my expectations! ...",
# expected_output="positive"
# ),
]
)
print(f"Created dataset with {len(sentiment_dataset.cases)} cases")Created dataset with 0 cases
Evaluators: How to Score Outputs¶
Two Types of Evaluation¶
Deterministic Evaluators: Code-based checks
- Exact matches
- Type checking
- Format validation (email, phone number, URL)
- PII detection
- Regular expression matching
Non-Deterministic Evaluators: Subjective assessment
- LLM as judge
- Human evaluation
- Quality metrics (accuracy, relevance, helpfulness)
# Check if output equals expected value
evaluator = EqualsExpected()
print(f"Evaluator: {evaluator}")Evaluator: EqualsExpected()
2. Type Checking¶
# Ensure output is correct type
evaluator = IsInstance('str')
print(f"Evaluator: {evaluator}")Evaluator: IsInstance(type_name='str')
3. Membership/Contains¶
# Check if key phrase appears in output
evaluator = Contains('return policy')
print(f"Evaluator: {evaluator}")Evaluator: Contains(value='return policy')
4. Performance Constraints¶
# Ensure agent responds quickly enough
evaluator = MaxDuration(seconds=2.0) # 2 seconds max
print(f"Evaluator: {evaluator}")Evaluator: MaxDuration(seconds=2.0)
Example: Adding Evaluators to Dataset¶
intent_dataset_with_evals = Dataset[str, str](
name="Intent Classification Tests",
cases=[
Case(
name="return_request",
inputs="I want to return my order",
expected_output="return"
),
Case(
name="shipping_status",
inputs="Where is my package?",
expected_output="shipping"
),
Case(
name="account_question",
inputs="How do I reset my password?",
expected_output="account"
),
],
evaluators=[
EqualsExpected(), # Applied to all cases
MaxDuration(seconds=2.0) # Response time check
]
)
print(f"Dataset has {len(intent_dataset_with_evals.evaluators)} evaluators")
print(f"Dataset has {len(intent_dataset_with_evals.cases)} cases")Dataset has 2 evaluators
Dataset has 3 cases
LLM as Judge: When Correctness is Subjective¶
- Sometimes there’s no single “correct” answer
- Example: “Write a friendly response to this complaint”
- Many valid responses exist
- Hard to check with deterministic rules
- Solution: Use another LLM to evaluate
LLMJudge Evaluator:¶
judge = LLMJudge(
rubric=(
"Score from 0-10 on friendliness and helpfulness. "
"Friendly responses should acknowledge the customer's frustration. "
"Helpful responses should offer concrete next steps."
),
model='anthropic:claude-haiku-4-5'
)
print("Created LLMJudge evaluator")Created LLMJudge evaluator
How it works:
- Your agent generates an output
- LLMJudge sends that output + rubric to an LLM
- LLM scores the output based on the rubric
- Score is recorded in the evaluation report
Custom Evaluators: Domain-Specific Checks¶
You can create custom evaluators by subclassing Evaluator:
class ContainsURL(Evaluator):
"""Check if output contains a valid URL."""
def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
output = ctx.output
# Simple URL detection
has_url = 'http://' in str(output) or 'https://' in str(output)
return EvaluationReason(
value=has_url,
explanation="URL found" if has_url else "No URL found"
)
# Test it
url_checker = ContainsURL()
print("Created custom ContainsURL evaluator")Created custom ContainsURL evaluator
Key points:
- Implement
evaluatemethod (can be sync or async) - Access inputs/outputs through
EvaluatorContext - Return
EvaluationReasonwith value and explanation
Exercise 2: Design Evaluators¶
For the sentiment analysis agent from Exercise 1:
- What deterministic evaluators would you use?
- Think about checking exact label matches, valid sentiment values
- What subjective aspects might need LLM evaluation?
- Example: When a review has mixed sentiment, is the chosen label reasonable?
- What would your LLMJudge rubric say?
- Design one custom evaluator for a domain-specific check
- Example: “Output should be all lowercase” or “Response time should be fast”
# TODO: Your code here
# Create evaluators for sentiment analysis
# Example deterministic evaluator
# sentiment_dataset.evaluators.append(EqualsExpected())
# Example custom evaluator
# class ValidSentiment(Evaluator):
# """Check if output is a valid sentiment label."""
# def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
# valid_sentiments = {"positive", "negative", "neutral"}
# is_valid = ctx.output.lower() in valid_sentiments
# return EvaluationReason(
# value=is_valid,
# explanation=f"Output is {'valid' if is_valid else 'invalid'} sentiment"
# )Running Evaluations: From Code to Reports¶
# Create a simple intent classifier agent
support_agent = Agent(
'anthropic:claude-haiku-4-5',
system_prompt=(
"You are a customer support intent classifier. "
"Classify user messages into one of: return, shipping, account. "
"Respond with ONLY the classification label, nothing else."
)
)
async def classify_intent(inputs: str) -> str:
"""Our task: classify customer message intent."""
result = await support_agent.run(inputs)
return result.output.lower().strip()
print("Created intent classification agent")Created intent classification agent
Step 2: Run evaluation¶
# Run experiment
report = await intent_dataset_with_evals.evaluate(
task=classify_intent,
max_concurrency=3, # Run 3 cases in parallel
progress=True # Show progress bar
)
print("\nEvaluation complete!")
print(f"Evaluated {len(report.cases)} cases")Step 3: Analyze results¶
# Print formatted report
report.print()
print("\n" + "="*50)
print("Detailed results:")
print("="*50)
# Or get the data programmatically
for case in report.cases:
print(f"\nCase: {case.name}")
print(f" Input: {case.inputs}")
print(f" Output: {case.output}")
print(f" Expected: {case.expected_output}")
print(f" Assertions: {case.assertions}")
print(f" All passed: {all(case.assertions.values())}")Visualizing Results with Logfire¶
While the printed reports are useful, Pydantic Logfire provides a web UI for visualizing and analyzing evaluation results over time.
Why use Logfire?
- Interactive dashboards for evaluation metrics
- Trace exploration for debugging failed cases
- Track trends across multiple evaluation runs
- Team collaboration and sharing results
Let’s configure Logfire integration:
Note: With Logfire configured, all subsequent evaluations will automatically send their results to the Logfire web UI. You can then:
- View evaluation reports in an interactive dashboard
- Explore individual traces and spans
- Compare results across multiple runs
- Set up alerts for failing evaluations
Visit logfire
Understanding Evaluation Results¶
import logfire
# Configure Logfire
# This will automatically send evaluation results to Logfire if token is present
logfire.configure(
send_to_logfire='if-token-present',
)
print("Logfire configured!")
print("Future evaluations will automatically appear in Logfire web UI")Logfire configured!
Future evaluations will automatically appear in Logfire web UI
Logfire project URL: ]8;id=50384;https://logfire-us.pydantic.dev/sglyon/cap-6318-example\https://logfire-us.pydantic.dev/sglyon/cap-6318-example]8;;\
Comparing Experiments: Tracking Improvements¶
Let’s modify our agent and compare results:
# Create an improved agent with better prompt
improved_agent = Agent(
'anthropic:claude-haiku-4-5',
system_prompt=(
"You are a customer support intent classifier. "
"Classify user messages into exactly one of these categories:\n"
"- return: for return/refund requests\n"
"- shipping: for delivery/tracking questions\n"
"- account: for login/password/profile issues\n\n"
"Respond with ONLY the classification label in lowercase, nothing else."
)
)
async def improved_classify_intent(inputs: str) -> str:
"""Improved task function."""
result = await improved_agent.run(inputs)
return result.output.lower().strip()
# Run evaluation with improved agent
improved_report = await intent_dataset_with_evals.evaluate(
task=improved_classify_intent,
max_concurrency=3,
progress=True
)# Compare against baseline
improved_report.print(baseline=report)Exercise 3: Run Your First Evaluation¶
Using the sentiment analysis agent you designed in Exercise 1:
- Create an agent that performs sentiment classification
- Use your dataset from Exercise 1 (add more cases if needed)
- Add 2-3 evaluators (mix of built-in and custom from Exercise 2)
- Run the evaluation using
await dataset.evaluate(task=your_function) - Interpret the results:
- Which cases passed/failed?
- What patterns do you notice?
- What would you improve about the agent or the test cases?
# TODO: Your code hereAdvanced Topics: Span-Based Evaluation and Dataset Generation¶
Span-Based Evaluation: Evaluating the Process, Not Just the Output¶
The Problem: Sometimes the final answer is correct, but the how matters
- Example: Math problem solving
- Output: “42” ✓ Correct!
- But did the agent:
- Use the right formula?
- Show its work?
- Make calculation errors that happened to cancel out?
Spans: Execution traces from OpenTelemetry
- Capture what the agent did internally
- Tool calls made
- LLM requests and responses
- Intermediate reasoning steps
HasMatchingSpan Evaluator:¶
# Ensure agent called a specific tool
span_evaluator = HasMatchingSpan(
query={'name_contains': 'calculator_tool'},
evaluation_name='used_calculator'
)
print("Created span-based evaluator")
print("This evaluator checks that a span named 'calculator_tool' was called")Why this matters:
- Catches “lucky guesses” where agent gets answer right for wrong reasons
- Validates agent is following intended reasoning process
- Useful for multi-step tasks where process correctness matters
Generating Datasets with LLMs¶
The Challenge: Creating comprehensive test datasets is tedious
- Need diverse inputs covering edge cases
- Need correct expected outputs
- Manual creation doesn’t scale
Solution: Use an LLM to generate test cases
from pydantic_evals.generation import generate_dataset
# Generate dataset for sentiment analysis
generated_dataset = await generate_dataset(
dataset_type=Dataset[str, str], # Input and output types
n_examples=5, # Generate 5 test cases
extra_instructions=(
"Create diverse restaurant reviews with clear sentiment. "
"Input should be a restaurant review. "
"Output should be the sentiment: positive, negative, or neutral."
),
model='anthropic:claude-haiku-4-5'
)
print(f"Generated {len(generated_dataset.cases)} test cases")
for i, case in enumerate(generated_dataset.cases[:3]):
print(f"\nCase {i+1}:")
print(f" Input: {case.inputs}")
print(f" Expected: {case.expected_output}")How it works:
- You specify the dataset type with input/output types (e.g.,
Dataset[str, str]) - LLM generates diverse test scenarios based on your instructions
- Returns properly structured
Datasetobject with generated cases - Can optionally save to file for version control (using
pathparameter)
Best practices:
- Review generated cases before using
- Mix generated and hand-crafted cases
- Regenerate periodically to expand coverage
- Use
extra_instructionsto guide the LLM toward specific edge cases
Evaluating RAG Systems: A Two-Stage Challenge¶
What is RAG?¶
- RAG = Retrieval-Augmented Generation
- Common pattern for AI agents that need to answer questions about documents/data
- Two stages:
- Retrieval: Find relevant documents/passages from knowledge base
- Generation: Use retrieved context to generate answer
Why RAG Evaluation is Different¶
- Traditional evaluation: just check the final answer
- RAG evaluation: need to check both stages
- Is the retrieval finding the right documents?
- Is the generation using those documents correctly?
- Failure can happen at either stage (or both!)
Example Failure Modes:
- ✓ Retrieval works, ✗ Generation fails: Found right docs, but hallucinated answer
- ✗ Retrieval fails, ✓ Generation works: Couldn’t find relevant docs, so generated plausible but wrong answer
- ✗ Both fail: Retrieved irrelevant docs and made up information
RAG Evaluation Metrics¶
Retrieval Metrics (Is the retrieval working?)
class PrecisionAtK(Evaluator):
"""Check if retrieved documents are relevant."""
def __init__(self, k: int = 5):
self.k = k
def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
# Assume ctx.metadata contains retrieved doc IDs and ground truth
retrieved_docs = ctx.metadata.get('retrieved_doc_ids', [])[:self.k]
relevant_docs = ctx.metadata.get('relevant_doc_ids', [])
relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
precision = len(relevant_retrieved) / self.k if self.k > 0 else 0
return EvaluationReason(
value=precision,
explanation=f"Retrieved {len(relevant_retrieved)}/{self.k} relevant docs"
)
class RecallAtK(Evaluator):
"""Check if all relevant documents were found."""
def __init__(self, k: int = 5):
self.k = k
def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
retrieved_docs = ctx.metadata.get('retrieved_doc_ids', [])[:self.k]
relevant_docs = ctx.metadata.get('relevant_doc_ids', [])
relevant_retrieved = set(retrieved_docs) & set(relevant_docs)
recall = len(relevant_retrieved) / len(relevant_docs) if len(relevant_docs) > 0 else 0
return EvaluationReason(
value=recall,
explanation=f"Found {len(relevant_retrieved)}/{len(relevant_docs)} relevant docs"
)
print("Created RAG retrieval evaluators")Generation Metrics (Is the generation working?)
class Faithfulness(Evaluator):
"""Check if answer is grounded in retrieved context."""
async def evaluate(self, ctx: EvaluatorContext) -> EvaluationReason:
answer = ctx.output
context = ctx.metadata.get('retrieved_context', '')
# Use LLM to judge faithfulness
judge = Agent('anthropic:claude-haiku-4-5')
result = await judge.run(
f"Context: {context}\n\n"
f"Answer: {answer}\n\n"
f"Is the answer fully supported by the context? "
f"Respond with YES, NO, or PARTIAL and explain why."
)
assessment = result.data
is_faithful = assessment.startswith('YES')
return EvaluationReason(
value=is_faithful,
explanation=assessment
)
print("Created Faithfulness evaluator")Best Practices for RAG Evaluation¶
1. Evaluate Stages Independently
Create separate datasets for retrieval and generation to pinpoint where failures occur.
# Example: Retrieval evaluation dataset
retrieval_dataset = Dataset(
name="Retrieval Quality",
cases=[
Case(
inputs={"query": "What are the return policies?"},
metadata={
"relevant_doc_ids": ["doc_42", "doc_87"], # Ground truth
"retrieved_doc_ids": ["doc_42", "doc_87", "doc_13", "doc_99", "doc_5"], # Simulated retrieval
}
),
],
evaluators=[
PrecisionAtK(k=5),
RecallAtK(k=5),
]
)
print("Created retrieval evaluation dataset")Why separate?
- Pinpoints where failures occur
- Can optimize retrieval and generation independently
- Clearer diagnosis: “Our retrieval is great but generation hallucinates” vs “Both need work”
Exercise: Design RAG Evaluation¶
Scenario: You’re building a RAG system that answers questions about a company’s internal documentation.
Tasks:
- Identify failure modes: What are 3 ways this RAG system could fail?
- Design retrieval tests: What cases would test if retrieval is working?
- What queries should always retrieve specific documents?
- What edge cases might break retrieval?
- Design generation tests: Assuming perfect retrieval, how do you test generation?
- What makes a “good” answer?
- How do you detect hallucinations?
- Create evaluation pipeline: Sketch code for evaluating both stages
- What metrics would you track?
- How would you report results?
# TODO: Your code hereIntegration and Best Practices¶
When to Use Each Evaluation Type¶
Deterministic Evaluators when:
- Clear right/wrong answers exist
- Output format matters (structured data)
- Security/safety constraints (no PII leakage)
- Performance requirements (latency, cost)
LLM as Judge when:
- Multiple valid answers exist
- Quality is subjective (helpfulness, tone)
- Semantic equivalence matters (“Paris” vs “The capital of France is Paris”)
Span-Based Evaluation when:
- Process correctness matters, not just output
- Multi-step reasoning needs validation
- Tool usage patterns are important
- Debugging complex agent behaviors
Tips for Effective Evaluation¶
Start Small, Grow Gradually:
- Begin with 5-10 cases covering main scenarios
- Add cases as you discover failures
- Prioritize cases that would impact users most
Balance Coverage and Maintainability:
- Don’t try to test everything
- Focus on high-risk or high-value scenarios
- Remove redundant cases
Make Evaluators Specific and Clear:
- Good rubric: “Score 0-10 on factual accuracy. Check claims against provided context.”
- Bad rubric: “Score the quality of the response.”
Version Control Your Datasets:
- Store datasets as YAML/JSON in git
- Track changes over time
- Share across team
Automate Where Possible:
- Run evals in CI/CD pipeline
- Block deploys if pass rate drops
- Generate alerts for regressions
Exercise: Evaluation Strategy Design¶
Choose one scenario:
- E-commerce support agent: Handles returns, shipping, account questions
- Code review agent: Reviews pull requests, suggests improvements
- Data analysis agent: Answers questions about datasets using pandas
For your chosen scenario:
- Design an evaluation strategy:
- What’s in your test dataset? (10+ cases)
- What evaluators would you use?
- How would you measure success?
- Describe your development workflow:
- When do you run evals?
- What metrics do you track?
- How do you decide when to deploy?
- Plan for production:
- How do you handle failures?
- When do you update your dataset?
- What triggers re-evaluation?
Connections to Course Themes¶
Game Theory and Evaluation¶
Agent alignment: Evaluation as mechanism design
- You design rubrics (rules) to incentivize desired behaviors
- LLM as Judge is like a referee in a game
- Pass/fail thresholds create strategic constraints
Adversarial evaluation: Red team vs Blue team
- Attackers try to make agent fail (jailbreaking, prompt injection)
- Defenders build evals that catch these attacks
- Nash equilibrium between robustness and capability
Network Effects in AI Systems¶
Evaluation datasets as networks:
- Cases can have dependencies (one builds on another)
- Failures can cascade (if base functionality breaks, many cases fail)
- Coverage metrics: are there “clusters” of untested scenarios?
Agent-to-agent evaluation:
- Multi-agent systems need coordinated evaluation
- Agent A’s outputs become Agent B’s inputs
- Network of evals reflects agent interaction topology
Emergence in Complex AI Systems¶
Emergent behaviors in multi-step agents:
- Simple agent + simple tools → complex behaviors
- Can’t predict all outcomes from components
- Evaluation discovers emergent capabilities (and failures)
Evaluation as an ABM simulation:
- Each test case is like running the simulation once
- Aggregate results reveal patterns
- Edge cases show boundary conditions of agent “behavior space”
Summary and Key Takeaways¶
What We Learned¶
Why Evals Matter
- AI systems are non-deterministic
- Systematic testing catches issues before production
- Evals enable confident iteration and deployment
Core Framework: Pydantic Evals
- Cases: individual test scenarios
- Datasets: collections of cases
- Evaluators: scoring mechanisms (deterministic, LLM, custom)
- Experiments: runs that generate reports
Evaluation Strategies
- Deterministic checks for clear criteria
- LLM as Judge for subjective quality
- Span-based evaluation for process correctness
- Custom evaluators for domain-specific needs
- RAG-specific metrics for retrieval + generation
Best Practices
- Start small, iterate based on failures
- Balance coverage with maintainability
- Version control datasets and track metrics
- Integrate into development and deployment workflows
Looking Forward¶
- Evaluations are “an emerging art/science”
- No single “right” approach exists
- Adapt techniques to your domain and constraints
- Key principle: Test systematically, deploy confidently
Final Exercise: Reflection¶
Think about an AI agent you might build:
- What are the top 3 risks or failure modes?
- How would you design evals to catch those?
- What would “success” look like quantitatively?