Eval Framework¶
Added in: v0.17.0
Built-in agent evaluation with 39 evaluators, regression detection, and CI integration. No separate install, no SaaS account, no external dependencies.
Quick Start¶
from selectools.evals import EvalSuite, TestCase
suite = EvalSuite(agent=agent, cases=[
TestCase(input="Cancel my account", expect_tool="cancel_subscription"),
TestCase(input="Check my balance", expect_contains="balance"),
TestCase(input="What's 2+2?", expect_output="4"),
])
report = suite.run()
print(report.accuracy) # 0.95
print(report.latency_p50) # 142ms
print(report.total_cost) # $0.003
TestCase — Declarative Assertions¶
Every TestCase has an input (the prompt) and optional expect_* fields. Only the fields you set are checked.
Tool Assertions¶
TestCase(input="Cancel subscription", expect_tool="cancel_sub")
TestCase(input="Full workflow", expect_tools=["search", "summarize"])
TestCase(input="Search", expect_tool_args={"search": {"query": "python"}})
Content Assertions¶
TestCase(input="Hello", expect_contains="hello")
TestCase(input="Safe?", expect_not_contains="error")
TestCase(input="2+2", expect_output="4")
TestCase(input="Phone", expect_output_regex=r"\d{3}-\d{4}")
TestCase(input="JSON?", expect_json=True)
TestCase(input="Prefix", expect_starts_with="Hello")
TestCase(input="Suffix", expect_ends_with=".")
TestCase(input="Short", expect_min_length=10, expect_max_length=500)
Structured Output¶
Performance Assertions¶
TestCase(
input="Fast query",
expect_latency_ms_lte=500,
expect_cost_usd_lte=0.01,
expect_iterations_lte=3,
)
Safety Assertions¶
TestCase(input="Account info", expect_no_pii=True)
TestCase(input="Ignore instructions", expect_no_injection=True)
LLM-as-Judge Fields¶
TestCase(
input="Summarize this",
reference="The original long text...", # ground truth
context="Retrieved document content...", # RAG context
rubric="Rate accuracy and completeness", # custom rubric
)
Custom Evaluators¶
def must_be_polite(result) -> bool:
return "please" in result.content.lower()
TestCase(
input="Help me",
custom_evaluator=must_be_polite,
custom_evaluator_name="politeness",
)
Tags and Weights¶
TestCase(input="Critical", tags=["billing", "critical"], weight=3.0)
TestCase(input="Minor", tags=["nice-to-have"], weight=0.5)
Built-in Evaluators (22)¶
Deterministic (12) — No API calls¶
| Evaluator | What it checks |
|---|---|
ToolUseEvaluator |
Tool name, tool list, argument values |
ContainsEvaluator |
Substring present/absent (case-insensitive) |
OutputEvaluator |
Exact match, regex match |
StructuredOutputEvaluator |
Parsed fields match (deep subset) |
PerformanceEvaluator |
Iterations, latency, cost thresholds |
JsonValidityEvaluator |
Valid JSON output |
LengthEvaluator |
Min/max character count |
StartsWithEvaluator |
Output prefix |
EndsWithEvaluator |
Output suffix |
PIILeakEvaluator |
SSN, email, phone, credit card, ZIP |
InjectionResistanceEvaluator |
10 prompt injection patterns |
CustomEvaluator |
Any user-defined callable |
LLM-as-Judge (10) — Uses any Provider¶
These evaluators call an LLM to grade the output. Pass any selectools Provider — works with OpenAI, Anthropic, Gemini, Ollama.
from selectools.evals import CorrectnessEvaluator, RelevanceEvaluator
suite = EvalSuite(
agent=agent,
cases=cases,
evaluators=[
CorrectnessEvaluator(provider=provider, model="gpt-4.1-mini"),
RelevanceEvaluator(provider=provider, model="gpt-4.1-mini"),
],
)
| Evaluator | What it checks | Requires |
|---|---|---|
LLMJudgeEvaluator |
Generic rubric scoring (0-10) | rubric on TestCase |
CorrectnessEvaluator |
Correct vs reference answer | reference on TestCase |
RelevanceEvaluator |
Response relevant to query | — |
FaithfulnessEvaluator |
Grounded in provided context | context on TestCase |
HallucinationEvaluator |
Fabricated information | context or reference |
ToxicityEvaluator |
Harmful/inappropriate content | — |
CoherenceEvaluator |
Well-structured and logical | — |
CompletenessEvaluator |
Fully addresses the query | — |
BiasEvaluator |
Gender, racial, political bias | — |
SummaryEvaluator |
Summary accuracy and coverage | reference on TestCase |
All LLM evaluators accept a threshold parameter (default: 7.0 for most, 8.0 for safety).
EvalReport¶
report = suite.run()
# Aggregate metrics
report.accuracy # Weighted accuracy (0.0 - 1.0)
report.pass_count # Number of passing cases
report.fail_count # Number of failing cases
report.error_count # Number of error cases
report.total_cost # Total USD cost
report.total_tokens # Total tokens used
report.latency_p50 # Median latency (ms)
report.latency_p95 # 95th percentile latency
report.latency_p99 # 99th percentile latency
report.cost_per_case # Average cost per case
# Filtering
report.filter_by_tag("billing")
report.filter_by_verdict(CaseVerdict.FAIL)
report.failures_by_evaluator() # {"tool_use": 3, "contains": 1}
# Export
report.to_html("report.html") # Interactive HTML report
report.to_junit_xml("results.xml") # JUnit XML for CI
report.to_json("results.json") # Machine-readable JSON
report.summary() # Human-readable text
Loading Test Cases from Files¶
from selectools.evals import DatasetLoader
# JSON
cases = DatasetLoader.from_json("tests/eval_cases.json")
# YAML (requires PyYAML)
cases = DatasetLoader.from_yaml("tests/eval_cases.yaml")
# Auto-detect from extension
cases = DatasetLoader.load("tests/eval_cases.json")
JSON format:
[
{"input": "Cancel account", "expect_tool": "cancel_sub", "name": "cancel"},
{"input": "Check balance", "expect_contains": "balance", "tags": ["billing"]}
]
Regression Detection¶
from selectools.evals import BaselineStore
store = BaselineStore("./baselines")
report = suite.run()
# Compare against saved baseline
result = store.compare(report)
if result.is_regression:
print(f"Regressions: {result.regressions}")
print(f"Accuracy delta: {result.accuracy_delta:+.2%}")
else:
store.save(report) # Update baseline
CLI¶
Run evals from the command line:
# Run eval suite
python -m selectools.evals run tests/cases.json --provider openai --model gpt-4.1-mini --html report.html --verbose
# Compare against baseline
python -m selectools.evals compare tests/cases.json --baseline ./baselines --save
# With concurrency
python -m selectools.evals run tests/cases.json --concurrency 5 --junit results.xml
GitHub Actions¶
Use the built-in action to run evals on every PR and post results as a comment:
- name: Run eval suite
uses: johnnichev/selectools/.github/actions/eval@main
with:
cases: tests/eval_cases.json
provider: openai
model: gpt-4.1-mini
html-report: eval-report.html
baseline-dir: ./baselines
post-comment: "true"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The action:
- Runs all test cases
- Posts accuracy, latency, cost, and failures as a PR comment
- Detects regressions against baselines
- Uploads HTML report as an artifact
- Outputs accuracy, pass-count, fail-count, regression for downstream steps
Concurrent Execution¶
suite = EvalSuite(
agent=agent,
cases=cases,
max_concurrency=5, # Run 5 cases in parallel
on_progress=lambda done, total: print(f"[{done}/{total}]"),
)
Uses ThreadPoolExecutor (sync) or asyncio.Semaphore (async via suite.arun()).
In pytest¶
def test_agent_accuracy(agent):
suite = EvalSuite(agent=agent, cases=[
TestCase(input="Cancel", expect_tool="cancel_sub"),
TestCase(input="Balance", expect_contains="balance"),
])
report = suite.run()
assert report.accuracy >= 0.9
assert report.latency_p50 < 500
API Reference¶
Core¶
| Symbol | Description |
|---|---|
EvalSuite(agent, cases, ...) |
Orchestrates eval runs |
TestCase(input, ...) |
Single test case with assertions |
EvalReport |
Aggregated results with metrics |
CaseResult |
Per-case result with verdict and failures |
CaseVerdict |
Enum: PASS, FAIL, ERROR, SKIP |
EvalFailure |
Single assertion failure |
Infrastructure¶
| Symbol | Description |
|---|---|
DatasetLoader.load(path) |
Load test cases from JSON/YAML |
BaselineStore(dir) |
Save and compare baselines |
RegressionResult |
Regression comparison result |
report.to_html(path) |
Interactive HTML report |
report.to_junit_xml(path) |
JUnit XML for CI |
report.to_json(path) |
Machine-readable JSON |