This example validates a simple MCP server that exposes a fetch tool to retrieve web content. The tests illustrate three styles (decorators, pytest, and legacy assertions) and demonstrate how to combine structural assertions, path constraints, and LLM judges.
Goals
Verify the agent calls the fetch tool when appropriate
Check extracted content for known signals (e.g., “Example Domain”)
Ensure efficient paths (no unnecessary steps)
Evaluate quality with rubric-based judges
What you’ll learn
Choosing assertions per outcome type (structural, tool, path, judge)
Designing resilient tests using immediate vs deferred checks
Reading metrics and span trees to diagnose behavior
Why: catches regressions where the agent “hallucinates” content without making tool calls, or switches to an unintended tool.Tip: combine with Expect.tools.count("fetch", 1) to detect duplicate calls.
Why: tool responses are often nested structures. Field‑scoped, regex/partial checks are stable across formatting differences and small content changes.
Some checks need subjective evaluation (e.g., “good summary”). Use rubric‑based judges:
judge = Expect.judge.llm( rubric="Response should demonstrate successful content extraction and provide a meaningful summary", min_score=0.8, include_input=True,)await session.assert_that(judge, response=resp, name="extraction_quality_assessment")
Why: judges provide a tunable gate (min_score) for non‑deterministic tasks. In CI, keep them few and scoped.