Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mcp-eval.ai/llms.txt

Use this file to discover all available pages before exploring further.

Evaluate your agent’s reasoning, tool use, recovery, and quality by driving it through realistic tasks.

Define the test agent

  • Global default:
import mcp_eval
from mcp_agent.agents.agent_spec import AgentSpec

mcp_eval.use_agent(
    AgentSpec(name="Fetcher", instruction="You fetch.", server_names=["fetch"])  # see [Settings](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/config.py)
)
  • Per‑test override with with_agent (place above @task):
from mcp_eval.core import with_agent, task
from mcp_agent.agents.agent import Agent

@with_agent(Agent(name="Custom", instruction="Custom", server_names=["fetch"]))  # see [Core](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/core.py)
@task("Custom agent test")
async def test_custom(agent, session):
    resp = await agent.generate_str("Fetch https://example.com")
  • Factory for parallel safety:
from mcp_eval.config import use_agent_factory
from mcp_agent.agents.agent import Agent

def make_agent():
    return Agent(name="Isolated", instruction="...", server_names=["fetch"])  # see [Settings](https://github.com/lastmile-ai/mcp-eval/blob/main/src/mcp_eval/config.py)

use_agent_factory(make_agent)
More patterns: agent_definition_examples.py.

What to measure

  • Tool behavior: Expect.tools.was_called, called_with, sequence, output_matches
  • Efficiency and iterations: Expect.performance.max_iterations, Expect.path.efficiency
  • Quality: Expect.judge.llm, Expect.judge.multi_criteria
  • Performance: response times, concurrency (see metrics)
# Efficiency and iteration bounds
await session.assert_that(Expect.performance.max_iterations(3))

# Tool behavior and outputs
await session.assert_that(Expect.tools.was_called("fetch"))
await session.assert_that(Expect.tools.output_matches("fetch", {"isError": False}, match_type="partial"))

# Path and sequence
await session.assert_that(Expect.tools.sequence(["fetch"], allow_other_calls=True))
await session.assert_that(Expect.path.efficiency(expected_tool_sequence=["fetch"], allow_extra_steps=1))

Styles for agent evals

Inspecting spans and metrics

metrics = session.get_metrics()
span_tree = session.get_span_tree()
Sources: