Documentation Index Fetch the complete documentation index at: https://mcp-eval.ai/llms.txt
Use this file to discover all available pages before exploring further.
⚙️ Configure with confidence! This comprehensive guide covers every configuration option, from basic setup to advanced customization. You’ll learn exactly how to tune mcp-eval for your specific needs.
Quick configuration finder
What do you need to configure?
Basic Setup Essential settings to get started
Servers MCP server connections
Agents Agent behavior and models
Providers LLM providers and API keys
Testing Test execution settings
Reporting Output formats and locations
Configuration overview
mcp-eval uses a layered configuration system that gives you flexibility and control:
File precedence (later overrides earlier)
mcp-agent.config.yaml - Base configuration for servers and providers
mcp-agent.secrets.yaml - Secure API keys and credentials
mcpeval.yaml - mcp-eval specific settings
mcpeval.secrets.yaml - mcp-eval specific secrets
Environment variables - Runtime overrides
Programmatic configuration - Code-level settings
File discovery
mcp-eval searches for configuration files in this order:
Current directory:
├── mcpeval.yaml
├── mcpeval.secrets.yaml
├── mcp-agent.config.yaml
├── mcp-agent.secrets.yaml
└── .mcp-eval/
├── config.yaml
└── secrets.yaml
Parent directories (recursive):
└── (same structure)
Home directory:
└── ~/.mcp-eval/
├── config.yaml
└── secrets.yaml
Basic configuration
Let’s start with a complete, working configuration:
Complete mcpeval.yaml example
# mcpeval.yaml
$schema : ./schema/mcpeval.config.schema.json
# Metadata
name : "My MCP Test Suite"
description : "Comprehensive testing for our MCP servers"
# Default LLM provider settings
provider : "anthropic"
model : "claude-3-5-sonnet-20241022"
# Default agent for tests
default_agent :
name : "test_agent"
instruction : "You are a helpful testing assistant. Be precise and thorough."
server_names : [ "calculator" , "weather" ]
# Judge configuration
judge :
provider : "anthropic" # Can differ from main provider
model : "claude-3-5-sonnet-20241022"
min_score : 0.8
max_tokens : 1000
system_prompt : "You are an expert evaluator. Be fair but strict."
# Metrics collection
metrics :
collect :
- "response_time"
- "tool_coverage"
- "iteration_count"
- "token_usage"
- "cost_estimate"
- "error_rate"
- "path_efficiency"
# Reporting configuration
reporting :
formats : [ "json" , "markdown" , "html" ]
output_dir : "./test-reports"
include_traces : true
include_config : true
timestamp_format : "%Y%m%d_%H%M%S"
# Test execution settings
execution :
max_concurrency : 5
timeout_seconds : 300
retry_failed : true
retry_count : 3
retry_delay : 5
parallel : true
stop_on_first_failure : false
verbose : false
debug : false
# Logging configuration
logging :
level : "INFO" # DEBUG, INFO, WARNING, ERROR
format : "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
file : "test-reports/mcp-eval.log"
console : true
show_mcp_messages : false # Set true for debugging
# Cache configuration
cache :
enabled : true
ttl : 3600 # 1 hour
directory : ".mcp-eval-cache"
# Development settings
development :
mock_llm_responses : false
save_llm_calls : true
profile_performance : false
Minimal configuration
If you just want to get started quickly:
# mcpeval.yaml (minimal)
provider : "anthropic"
model : "claude-3-haiku-20240307"
mcp :
servers :
my_server :
command : "python"
args : [ "server.py" ]
Server configuration
Configure your MCP servers for testing:
Basic server setup
# In mcp-agent.config.yaml or mcpeval.yaml
mcp :
servers :
# Simple Python server
calculator :
command : "python"
args : [ "servers/calculator.py" ]
env :
LOG_LEVEL : "DEBUG"
# Node.js server with npm
weather :
command : "npm"
args : [ "run" , "start:weather" ]
cwd : "./servers/weather"
# Pre-built server from package
fetch :
command : "uvx"
args : [ "mcp-server-fetch" ]
env :
UV_NO_PROGRESS : "1"
# Docker container server
database :
command : "docker"
args : [ "run" , "--rm" , "-i" , "my-mcp-server:latest" ]
startup_timeout : 30 # Wait for container to start
Advanced server options
mcp :
servers :
advanced_server :
# Transport configuration
transport : "stdio" # or "http" for HTTP transport
# For HTTP transport
url : "http://localhost:8080"
headers :
Authorization : "Bearer ${SERVER_API_KEY}"
# Command execution
command : "python"
args : [ "server.py" , "--port" , "8080" ]
cwd : "/path/to/server"
# Environment variables
env :
DATABASE_URL : "${DATABASE_URL}"
API_KEY : "${API_KEY}"
DEBUG : "true"
# Lifecycle management
startup_timeout : 10 # Seconds to wait for startup
shutdown_timeout : 5 # Seconds to wait for shutdown
restart_on_failure : true
max_restarts : 3
# Health checks
health_check :
endpoint : "/health"
interval : 30
timeout : 5
# Resource limits
resources :
max_memory : "512M"
max_cpu : "1.0"
Importing servers from other sources
# Import from mcp.json (Cursor/VS Code)
mcp :
import :
- type : "mcp_json"
path : ".cursor/mcp.json"
# Import from DXT manifest
- type : "dxt"
path : "~/Desktop/my-manifest.dxt"
Agent configuration
Define agents for different testing scenarios:
Agent specifications
# In mcp-agent.config.yaml
agents :
- name : "comprehensive_tester"
instruction : |
You are a thorough testing agent. Your job is to:
1. Test all available tools systematically
2. Verify outputs are correct
3. Handle errors gracefully
4. Report issues clearly
server_names : [ "calculator" , "weather" , "database" ]
model : "claude-3-5-sonnet-20241022"
temperature : 0 # Deterministic for testing
max_tokens : 4000
- name : "minimal_tester"
instruction : "Test basic functionality quickly."
server_names : [ "calculator" ]
model : "claude-3-haiku-20240307" # Cheaper for simple tests
# Subagents for specific tasks
subagents :
enabled : true
search_paths :
- ".claude/agents"
- ".mcp-agent/agents"
pattern : "**/*.yaml"
inline :
- name : "error_specialist"
instruction : "Focus on finding and testing error conditions."
server_names : [ "*" ] # Access to all servers
functions :
- name : "validate_error"
description : "Check if error is handled correctly"
Agent selection strategies
# Use specific agent for different test types
test_strategies :
unit :
agent : "minimal_tester"
timeout : 60
integration :
agent : "comprehensive_tester"
timeout : 300
stress :
agent : "stress_tester"
timeout : 600
max_iterations : 100
Provider configuration
Configure LLM providers and authentication:
Anthropic configuration
# In mcpeval.secrets.yaml (keep out of version control!)
anthropic :
api_key : "sk-ant-api03-..."
base_url : "https://api.anthropic.com" # Optional custom endpoint
default_model : "claude-3-5-sonnet-20241022"
# Model-specific settings
models :
claude-3-5-sonnet-20241022 :
max_tokens : 8192
temperature : 0.7
top_p : 0.95
claude-3-haiku-20240307 :
max_tokens : 4096
temperature : 0.3 # More deterministic for testing
OpenAI configuration
# In mcpeval.secrets.yaml
openai :
api_key : "sk-..."
organization : "org-..." # Optional
base_url : "https://api.openai.com/v1"
default_model : "gpt-4-turbo-preview"
models :
gpt-4-turbo-preview :
max_tokens : 4096
temperature : 0.5
presence_penalty : 0.1
frequency_penalty : 0.1
Environment variable overrides
# Override configuration via environment
export ANTHROPIC_API_KEY = "sk-ant-..."
export OPENAI_API_KEY = "sk-..."
# Custom provider settings
export MCP_EVAL_PROVIDER = "anthropic"
export MCP_EVAL_MODEL = "claude-3-5-sonnet-20241022"
export MCP_EVAL_TIMEOUT = "600"
Test execution configuration
Fine-tune how tests are executed:
Execution strategies
execution :
# Concurrency control
max_concurrency : 5 # Max parallel tests
max_workers : 10 # Max parallel tool calls
# Timeout management
timeout_seconds : 300 # Global timeout
timeouts :
unit : 60
integration : 300
stress : 600
# Retry logic
retry_failed : true
retry_count : 3
retry_delay : 5 # Seconds between retries
retry_backoff : "exponential" # or "linear"
retry_on_errors :
- "RateLimitError"
- "NetworkError"
- "TimeoutError"
# Execution control
parallel : true
randomize_order : false # Run tests in random order
stop_on_first_failure : false
fail_fast_threshold : 0.5 # Stop if >50% fail
# Resource management
max_memory_mb : 2048
kill_timeout : 10 # Force kill after this many seconds
# Test selection
markers :
skip : [ "slow" , "flaky" ] # Skip these markers
only : [] # Only run these markers
patterns :
include : [ "test_*.py" , "*_test.py" ]
exclude : [ "test_experimental_*.py" ]
Performance optimization
performance :
# Caching
cache_llm_responses : true
cache_ttl : 3600
cache_size_mb : 100
# Batching
batch_size : 10 # Process tests in batches
batch_timeout : 30
# Rate limiting
requests_per_second : 10
burst_limit : 20
# Connection pooling
max_connections : 20
connection_timeout : 10
# Memory management
gc_threshold : 100 # Force garbage collection after N tests
clear_cache_after : 50 # Clear caches after N tests
Reporting configuration
Control how results are reported:
Output formats and locations
reporting :
# Output formats
formats :
- "json" # Machine-readable
- "markdown" # Human-readable
- "html" # Interactive
- "junit" # CI integration
- "csv" # Spreadsheet analysis
# Output configuration
output_dir : "./test-reports"
create_subdirs : true # Organize by date/time
# Report naming
filename_template : "{suite}_{timestamp}_{status}"
timestamp_format : "%Y%m%d_%H%M%S"
# Content options
include_traces : true
include_config : true
include_environment : true
include_git_info : true
include_system_info : true
# Report detail levels
verbosity :
console : "summary" # minimal, summary, detailed, verbose
file : "detailed"
html : "verbose"
# Filtering
show_passed : true
show_failed : true
show_skipped : false
max_output_length : 10000 # Truncate long outputs
# Metrics and analytics
calculate_statistics : true
generate_charts : true
trend_analysis : true
# Notifications
notifications :
slack :
webhook_url : "${SLACK_WEBHOOK}"
on_failure : true
on_success : false
email :
smtp_server : "smtp.gmail.com"
from : "tests@example.com"
to : [ "team@example.com" ]
on_failure : true
Custom report templates
reporting :
templates :
markdown : "templates/custom_report.md.jinja"
html : "templates/custom_report.html.jinja"
custom_fields :
project_name : "My MCP Project"
team : "Platform Team"
environment : "staging"
Judge configuration
Configure LLM judges for quality evaluation:
judge :
# Provider settings (can differ from main provider)
provider : "anthropic"
model : "claude-3-5-sonnet-20241022"
# Scoring configuration
min_score : 0.8 # Global minimum score
score_thresholds :
critical : 0.95
high : 0.85
medium : 0.70
low : 0.50
# Judge behavior
max_tokens : 2000
temperature : 0.3 # Lower for consistency
# Judge prompts
system_prompt : |
You are an expert quality evaluator for AI responses.
Be thorough, fair, and consistent in your evaluations.
Provide clear reasoning for your scores.
# Evaluation settings
require_reasoning : true
require_confidence : true
use_cot : true # Chain-of-thought
# Multi-criteria defaults
multi_criteria :
aggregate_method : "weighted" # weighted, min, harmonic_mean
require_all_pass : false
min_criteria_score : 0.7
# Calibration
calibration :
enabled : true
samples : 100
adjust_thresholds : true
Environment-specific configuration
Different settings for different environments:
Development configuration
# mcpeval.dev.yaml
$extends : "./mcpeval.yaml" # Inherit base config
provider : "anthropic"
model : "claude-3-haiku-20240307" # Cheaper for dev
execution :
max_concurrency : 1 # Easier debugging
timeout_seconds : 600 # More time for debugging
debug : true
development :
mock_llm_responses : true # Use mocked responses
save_llm_calls : true
profile_performance : true
logging :
level : "DEBUG"
show_mcp_messages : true
CI/CD configuration
# mcpeval.ci.yaml
$extends : "./mcpeval.yaml"
execution :
max_concurrency : 10 # Maximize parallelism
timeout_seconds : 180 # Strict timeouts
retry_failed : false # Don't hide flaky tests
stop_on_first_failure : true
reporting :
formats : [ "junit" , "json" ] # CI-friendly formats
ci :
fail_on_quality_gate : true
min_pass_rate : 0.95
max_test_duration : 300
Production configuration
# mcpeval.prod.yaml
$extends : "./mcpeval.yaml"
provider : "anthropic"
model : "claude-3-5-sonnet-20241022" # Best model for production
execution :
max_concurrency : 20
timeout_seconds : 120
retry_failed : true
retry_count : 5
monitoring :
enabled : true
metrics_endpoint : "https://metrics.example.com"
alerting :
enabled : true
thresholds :
error_rate : 0.05
p95_latency : 5000
Programmatic configuration
Configure mcp-eval from code:
Basic programmatic setup
from mcp_eval.config import set_settings, MCPEvalSettings, use_agent
from mcp_agent.agents.agent import Agent
# Configure via dictionary
set_settings({
"provider" : "anthropic" ,
"model" : "claude-3-5-sonnet-20241022" ,
"reporting" : {
"output_dir" : "./my-reports" ,
"formats" : [ "html" , "json" ]
},
"execution" : {
"timeout_seconds" : 120 ,
"max_concurrency" : 3
}
})
# Or use typed settings
settings = MCPEvalSettings(
provider = "anthropic" ,
model = "claude-3-haiku-20240307" ,
judge = { "min_score" : 0.85 },
reporting = { "output_dir" : "./test-output" }
)
set_settings(settings)
# Configure agent
agent = Agent(
name = "my_test_agent" ,
instruction = "Test thoroughly" ,
server_names = [ "my_server" ]
)
use_agent(agent)
Advanced programmatic control
from mcp_eval.config import (
load_config,
get_settings,
use_config,
ProgrammaticDefaults
)
# Load specific config file
config = load_config( "configs/staging.yaml" )
use_config(config)
# Modify settings at runtime
current = get_settings()
current.execution.timeout_seconds = 600
current.reporting.formats.append( "csv" )
# Set programmatic defaults
defaults = ProgrammaticDefaults()
defaults.set_agent_factory( lambda : create_custom_agent())
defaults.set_default_servers([ "server1" , "server2" ])
# Context manager for temporary config
from mcp_eval.config import config_context
with config_context({ "provider" : "openai" , "model" : "gpt-4" }):
# Tests here use OpenAI
run_tests()
# Back to original config
Environment variable reference
Complete list of environment variables:
# Provider settings
ANTHROPIC_API_KEY = "sk-ant-..."
OPENAI_API_KEY = "sk-..."
MCP_EVAL_PROVIDER = "anthropic"
MCP_EVAL_MODEL = "claude-3-5-sonnet-20241022"
# Execution settings
MCP_EVAL_TIMEOUT = "300"
MCP_EVAL_MAX_CONCURRENCY = "5"
MCP_EVAL_RETRY_COUNT = "3"
MCP_EVAL_DEBUG = "true"
# Reporting
MCP_EVAL_OUTPUT_DIR = "./reports"
MCP_EVAL_REPORT_FORMATS = "json,html,markdown"
# Judge settings
MCP_EVAL_JUDGE_MODEL = "claude-3-5-sonnet-20241022"
MCP_EVAL_JUDGE_MIN_SCORE = "0.8"
# Development
MCP_EVAL_MOCK_LLM = "false"
MCP_EVAL_SAVE_TRACES = "true"
MCP_EVAL_PROFILE = "false"
# Logging
MCP_EVAL_LOG_LEVEL = "INFO"
MCP_EVAL_LOG_FILE = "mcp-eval.log"
Configuration validation
Ensure your configuration is correct:
Using the validate command
# Validate all configuration
mcp-eval validate
# Validate specific aspects
mcp-eval validate --servers
mcp-eval validate --agents
Programmatic validation
from mcp_eval.config import validate_config
# Validate configuration
errors = validate_config( "mcpeval.yaml" )
if errors:
print ( "Configuration errors:" )
for error in errors:
print ( f " - { error } " )
sys.exit( 1 )
Schema validation
# Add schema reference for IDE support
$schema : "./schema/mcpeval.config.schema.json"
# Your configuration here...
Best practices
Follow these guidelines for maintainable configuration:
Keep secrets separate Never commit API keys. Use .secrets.yaml files and add to .gitignore
Use environment layers Create dev, staging, and prod configs that extend a base configuration
Document settings Add comments explaining non-obvious configuration choices
Validate regularly Run mcp-eval validate in CI to catch configuration issues early
Version control configs Track configuration changes except for secrets files
Use defaults wisely Set sensible defaults but allow overrides for flexibility
Troubleshooting configuration
Common configuration issues and solutions:
Issue Solution Config not found Check file name and location, use --config flag Invalid YAML Validate syntax with yamllint or online validator Server won’t start Check command path, permissions, and dependencies API key errors Verify key in secrets file or environment variable Wrong model used Check precedence: code > env > config file Timeout too short Increase execution.timeout_seconds
Configuration examples
Minimal testing setup
# Quick start configuration
provider : "anthropic"
model : "claude-3-haiku-20240307"
mcp :
servers :
my_server :
command : "python"
args : [ "server.py" ]
Comprehensive testing suite
See the complete example at the beginning of this guide.
Multi-environment setup
# Directory structure
configs/
├── base.yaml # Shared configuration
├── dev.yaml # Development overrides
├── staging.yaml # Staging overrides
├── prod.yaml # Production settings
└── secrets.yaml # API keys (gitignored)
You’re now a configuration expert! With this knowledge, you can tune mcp-eval to work perfectly for your specific testing needs. Remember: start simple and add complexity as needed! 🎯