Evals (beta)¶

View and analyze your evaluation results in Pydantic Logfire's web interface. Evals provide observability into how your AI systems perform across different test cases and experiments over time.

Code-First Evaluation

Evals are created and run using the Pydantic Evals package, which is developed in tandem with Pydantic AI. Logfire serves as an observability layer where you can view and compare results.

To get started, refer to the Pydantic Evals installation guide.

What are Evals?¶

Evals help you systematically test and evaluate AI systems by running them against predefined test cases. Each evaluation experiment appears in Logfire automatically when you run the pydantic_evals.Dataset.evaluate method with Logfire integration enabled.

For the data model, examples and full documentation on creating and running Evals, read the Pydantic Evals docs

Viewing Experiments¶

The Evals tab shows all experiments for your project available within your data retention period. Each experiment represents a single run of a dataset against a task function.

Experiment List¶

Each experiment displays:

Experiment name - Auto-generated by Logfire (e.g., "gentle-sniff-buses")
Task name - The function being evaluated
Span link - Direct link to the detailed trace
Created timestamp - When the experiment was run

Click on any experiment to view detailed results.

Experiment Details¶

Individual experiment pages show comprehensive results including:

Test cases with inputs, expected outputs, and actual outputs
Assertion results - Pass/fail status for each evaluator
Performance metrics - Duration, token usage, and custom scores
Evaluation scores - Detailed scoring from all evaluators

Comparing Experiments¶

Use the experiment comparison view to analyze performance across different runs:

Select multiple experiments from the list
Click Compare selected
View side-by-side results for the same test cases

The comparison view highlights:

Differences in outputs between experiment runs
Score variations across evaluators
Performance changes in metrics like duration and token usage
Regression detection when comparing baseline vs current implementations

Integration with Traces¶

Every evaluation experiment generates detailed OpenTelemetry traces that appear in Logfire:

Experiment span - Root span containing all evaluation metadata
Case execution spans - Individual test case runs with full context
Task function spans - Detailed tracing of your AI system under test
Evaluator spans - Scoring and assessment execution details

Navigate from experiment results to full trace details using the span links.