Configuration Guide

⚙️ Configure with confidence! This comprehensive guide covers every configuration option, from basic setup to advanced customization. You’ll learn exactly how to tune mcp-eval for your specific needs.

Quick configuration finder

What do you need to configure?

Basic Setup

Essential settings to get started

Servers

MCP server connections

Agents

Agent behavior and models

Providers

LLM providers and API keys

Testing

Test execution settings

Reporting

Output formats and locations

Configuration overview

mcp-eval uses a layered configuration system that gives you flexibility and control:

File precedence (later overrides earlier)

mcp-agent.config.yaml - Base configuration for servers and providers
mcp-agent.secrets.yaml - Secure API keys and credentials
mcpeval.yaml - mcp-eval specific settings
mcpeval.secrets.yaml - mcp-eval specific secrets
Environment variables - Runtime overrides
Programmatic configuration - Code-level settings

File discovery

mcp-eval searches for configuration files in this order:

Current directory:
  ├── mcpeval.yaml
  ├── mcpeval.secrets.yaml
  ├── mcp-agent.config.yaml
  ├── mcp-agent.secrets.yaml
  └── .mcp-eval/
      ├── config.yaml
      └── secrets.yaml

Parent directories (recursive):
  └── (same structure)

Home directory:
  └── ~/.mcp-eval/
      ├── config.yaml
      └── secrets.yaml

Basic configuration

Let’s start with a complete, working configuration:

Complete mcpeval.yaml example

# mcpeval.yaml
$schema: ./schema/mcpeval.config.schema.json

# Metadata
name: "My MCP Test Suite"
description: "Comprehensive testing for our MCP servers"

# Default LLM provider settings
provider: "anthropic"
model: "claude-3-5-sonnet-20241022"

# Default agent for tests
default_agent:
  name: "test_agent"
  instruction: "You are a helpful testing assistant. Be precise and thorough."
  server_names: ["calculator", "weather"]
  
# Judge configuration
judge:
  provider: "anthropic"  # Can differ from main provider
  model: "claude-3-5-sonnet-20241022"
  min_score: 0.8
  max_tokens: 1000
  system_prompt: "You are an expert evaluator. Be fair but strict."

# Metrics collection
metrics:
  collect:
    - "response_time"
    - "tool_coverage"
    - "iteration_count"
    - "token_usage"
    - "cost_estimate"
    - "error_rate"
    - "path_efficiency"

# Reporting configuration
reporting:
  formats: ["json", "markdown", "html"]
  output_dir: "./test-reports"
  include_traces: true
  include_config: true
  timestamp_format: "%Y%m%d_%H%M%S"

# Test execution settings
execution:
  max_concurrency: 5
  timeout_seconds: 300
  retry_failed: true
  retry_count: 3
  retry_delay: 5
  parallel: true
  stop_on_first_failure: false
  verbose: false
  debug: false

# Logging configuration
logging:
  level: "INFO"  # DEBUG, INFO, WARNING, ERROR
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
  file: "test-reports/mcp-eval.log"
  console: true
  show_mcp_messages: false  # Set true for debugging

# Cache configuration
cache:
  enabled: true
  ttl: 3600  # 1 hour
  directory: ".mcp-eval-cache"
  
# Development settings
development:
  mock_llm_responses: false
  save_llm_calls: true
  profile_performance: false

Minimal configuration

If you just want to get started quickly:

# mcpeval.yaml (minimal)
provider: "anthropic"
model: "claude-3-haiku-20240307"

mcp:
  servers:
    my_server:
      command: "python"
      args: ["server.py"]

Server configuration

Configure your MCP servers for testing:

Basic server setup

# In mcp-agent.config.yaml or mcpeval.yaml
mcp:
  servers:
    # Simple Python server
    calculator:
      command: "python"
      args: ["servers/calculator.py"]
      env:
        LOG_LEVEL: "DEBUG"
    
    # Node.js server with npm
    weather:
      command: "npm"
      args: ["run", "start:weather"]
      cwd: "./servers/weather"
    
    # Pre-built server from package
    fetch:
      command: "uvx"
      args: ["mcp-server-fetch"]
      env:
        UV_NO_PROGRESS: "1"
    
    # Docker container server
    database:
      command: "docker"
      args: ["run", "--rm", "-i", "my-mcp-server:latest"]
      startup_timeout: 30  # Wait for container to start

Advanced server options

mcp:
  servers:
    advanced_server:
      # Transport configuration
      transport: "stdio"  # or "http" for HTTP transport
      
      # For HTTP transport
      url: "http://localhost:8080"
      headers:
        Authorization: "Bearer ${SERVER_API_KEY}"
      
      # Command execution
      command: "python"
      args: ["server.py", "--port", "8080"]
      cwd: "/path/to/server"
      
      # Environment variables
      env:
        DATABASE_URL: "${DATABASE_URL}"
        API_KEY: "${API_KEY}"
        DEBUG: "true"
      
      # Lifecycle management
      startup_timeout: 10  # Seconds to wait for startup
      shutdown_timeout: 5  # Seconds to wait for shutdown
      restart_on_failure: true
      max_restarts: 3
      
      # Health checks
      health_check:
        endpoint: "/health"
        interval: 30
        timeout: 5
        
      # Resource limits
      resources:
        max_memory: "512M"
        max_cpu: "1.0"

Importing servers from other sources

# Import from mcp.json (Cursor/VS Code)
mcp:
  import:
    - type: "mcp_json"
      path: ".cursor/mcp.json"
    
    # Import from DXT manifest
    - type: "dxt"
      path: "~/Desktop/my-manifest.dxt"

Agent configuration

Define agents for different testing scenarios:

Agent specifications

# In mcp-agent.config.yaml
agents:
  - name: "comprehensive_tester"
    instruction: |
      You are a thorough testing agent. Your job is to:
      1. Test all available tools systematically
      2. Verify outputs are correct
      3. Handle errors gracefully
      4. Report issues clearly
    server_names: ["calculator", "weather", "database"]
    model: "claude-3-5-sonnet-20241022"
    temperature: 0  # Deterministic for testing
    max_tokens: 4000
    
  - name: "minimal_tester"
    instruction: "Test basic functionality quickly."
    server_names: ["calculator"]
    model: "claude-3-haiku-20240307"  # Cheaper for simple tests

# Subagents for specific tasks
subagents:
  enabled: true
  search_paths:
    - ".claude/agents"
    - ".mcp-agent/agents"
  pattern: "**/*.yaml"
  
  inline:
    - name: "error_specialist"
      instruction: "Focus on finding and testing error conditions."
      server_names: ["*"]  # Access to all servers
      functions:
        - name: "validate_error"
          description: "Check if error is handled correctly"

Agent selection strategies

# Use specific agent for different test types
test_strategies:
  unit:
    agent: "minimal_tester"
    timeout: 60
    
  integration:
    agent: "comprehensive_tester"
    timeout: 300
    
  stress:
    agent: "stress_tester"
    timeout: 600
    max_iterations: 100

Provider configuration

Configure LLM providers and authentication:

Anthropic configuration

# In mcpeval.secrets.yaml (keep out of version control!)
anthropic:
  api_key: "sk-ant-api03-..."
  base_url: "https://api.anthropic.com"  # Optional custom endpoint
  default_model: "claude-3-5-sonnet-20241022"
  
  # Model-specific settings
  models:
    claude-3-5-sonnet-20241022:
      max_tokens: 8192
      temperature: 0.7
      top_p: 0.95
    
    claude-3-haiku-20240307:
      max_tokens: 4096
      temperature: 0.3  # More deterministic for testing

OpenAI configuration

# In mcpeval.secrets.yaml
openai:
  api_key: "sk-..."
  organization: "org-..."  # Optional
  base_url: "https://api.openai.com/v1"
  default_model: "gpt-4-turbo-preview"
  
  models:
    gpt-4-turbo-preview:
      max_tokens: 4096
      temperature: 0.5
      presence_penalty: 0.1
      frequency_penalty: 0.1

Environment variable overrides

# Override configuration via environment
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

# Custom provider settings
export MCP_EVAL_PROVIDER="anthropic"
export MCP_EVAL_MODEL="claude-3-5-sonnet-20241022"
export MCP_EVAL_TIMEOUT="600"

Test execution configuration

Fine-tune how tests are executed:

Execution strategies

execution:
  # Concurrency control
  max_concurrency: 5  # Max parallel tests
  max_workers: 10  # Max parallel tool calls
  
  # Timeout management
  timeout_seconds: 300  # Global timeout
  timeouts:
    unit: 60
    integration: 300
    stress: 600
  
  # Retry logic
  retry_failed: true
  retry_count: 3
  retry_delay: 5  # Seconds between retries
  retry_backoff: "exponential"  # or "linear"
  retry_on_errors:
    - "RateLimitError"
    - "NetworkError"
    - "TimeoutError"
  
  # Execution control
  parallel: true
  randomize_order: false  # Run tests in random order
  stop_on_first_failure: false
  fail_fast_threshold: 0.5  # Stop if >50% fail
  
  # Resource management
  max_memory_mb: 2048
  kill_timeout: 10  # Force kill after this many seconds
  
  # Test selection
  markers:
    skip: ["slow", "flaky"]  # Skip these markers
    only: []  # Only run these markers
  
  patterns:
    include: ["test_*.py", "*_test.py"]
    exclude: ["test_experimental_*.py"]

Performance optimization

performance:
  # Caching
  cache_llm_responses: true
  cache_ttl: 3600
  cache_size_mb: 100
  
  # Batching
  batch_size: 10  # Process tests in batches
  batch_timeout: 30
  
  # Rate limiting
  requests_per_second: 10
  burst_limit: 20
  
  # Connection pooling
  max_connections: 20
  connection_timeout: 10
  
  # Memory management
  gc_threshold: 100  # Force garbage collection after N tests
  clear_cache_after: 50  # Clear caches after N tests

Reporting configuration

Control how results are reported:

Output formats and locations

reporting:
  # Output formats
  formats:
    - "json"      # Machine-readable
    - "markdown"  # Human-readable
    - "html"      # Interactive
    - "junit"     # CI integration
    - "csv"       # Spreadsheet analysis
  
  # Output configuration
  output_dir: "./test-reports"
  create_subdirs: true  # Organize by date/time
  
  # Report naming
  filename_template: "{suite}_{timestamp}_{status}"
  timestamp_format: "%Y%m%d_%H%M%S"
  
  # Content options
  include_traces: true
  include_config: true
  include_environment: true
  include_git_info: true
  include_system_info: true
  
  # Report detail levels
  verbosity:
    console: "summary"  # minimal, summary, detailed, verbose
    file: "detailed"
    html: "verbose"
  
  # Filtering
  show_passed: true
  show_failed: true
  show_skipped: false
  max_output_length: 10000  # Truncate long outputs
  
  # Metrics and analytics
  calculate_statistics: true
  generate_charts: true
  trend_analysis: true
  
  # Notifications
  notifications:
    slack:
      webhook_url: "${SLACK_WEBHOOK}"
      on_failure: true
      on_success: false
    
    email:
      smtp_server: "smtp.gmail.com"
      from: "tests@example.com"
      to: ["team@example.com"]
      on_failure: true

Custom report templates

reporting:
  templates:
    markdown: "templates/custom_report.md.jinja"
    html: "templates/custom_report.html.jinja"
  
  custom_fields:
    project_name: "My MCP Project"
    team: "Platform Team"
    environment: "staging"

Judge configuration

Configure LLM judges for quality evaluation:

judge:
  # Provider settings (can differ from main provider)
  provider: "anthropic"
  model: "claude-3-5-sonnet-20241022"
  
  # Scoring configuration
  min_score: 0.8  # Global minimum score
  score_thresholds:
    critical: 0.95
    high: 0.85
    medium: 0.70
    low: 0.50
  
  # Judge behavior
  max_tokens: 2000
  temperature: 0.3  # Lower for consistency
  
  # Judge prompts
  system_prompt: |
    You are an expert quality evaluator for AI responses.
    Be thorough, fair, and consistent in your evaluations.
    Provide clear reasoning for your scores.
  
  # Evaluation settings
  require_reasoning: true
  require_confidence: true
  use_cot: true  # Chain-of-thought
  
  # Multi-criteria defaults
  multi_criteria:
    aggregate_method: "weighted"  # weighted, min, harmonic_mean
    require_all_pass: false
    min_criteria_score: 0.7
  
  # Calibration
  calibration:
    enabled: true
    samples: 100
    adjust_thresholds: true

Environment-specific configuration

Different settings for different environments:

Development configuration

# mcpeval.dev.yaml
$extends: "./mcpeval.yaml"  # Inherit base config

provider: "anthropic"
model: "claude-3-haiku-20240307"  # Cheaper for dev

execution:
  max_concurrency: 1  # Easier debugging
  timeout_seconds: 600  # More time for debugging
  debug: true

development:
  mock_llm_responses: true  # Use mocked responses
  save_llm_calls: true
  profile_performance: true
  
logging:
  level: "DEBUG"
  show_mcp_messages: true

CI/CD configuration

# mcpeval.ci.yaml
$extends: "./mcpeval.yaml"

execution:
  max_concurrency: 10  # Maximize parallelism
  timeout_seconds: 180  # Strict timeouts
  retry_failed: false  # Don't hide flaky tests
  stop_on_first_failure: true

reporting:
  formats: ["junit", "json"]  # CI-friendly formats
  
ci:
  fail_on_quality_gate: true
  min_pass_rate: 0.95
  max_test_duration: 300

Production configuration

# mcpeval.prod.yaml
$extends: "./mcpeval.yaml"

provider: "anthropic"
model: "claude-3-5-sonnet-20241022"  # Best model for production

execution:
  max_concurrency: 20
  timeout_seconds: 120
  retry_failed: true
  retry_count: 5

monitoring:
  enabled: true
  metrics_endpoint: "https://metrics.example.com"
  
alerting:
  enabled: true
  thresholds:
    error_rate: 0.05
    p95_latency: 5000

Programmatic configuration

Configure mcp-eval from code:

Basic programmatic setup

from mcp_eval.config import set_settings, MCPEvalSettings, use_agent
from mcp_agent.agents.agent import Agent

# Configure via dictionary
set_settings({
    "provider": "anthropic",
    "model": "claude-3-5-sonnet-20241022",
    "reporting": {
        "output_dir": "./my-reports",
        "formats": ["html", "json"]
    },
    "execution": {
        "timeout_seconds": 120,
        "max_concurrency": 3
    }
})

# Or use typed settings
settings = MCPEvalSettings(
    provider="anthropic",
    model="claude-3-haiku-20240307",
    judge={"min_score": 0.85},
    reporting={"output_dir": "./test-output"}
)
set_settings(settings)

# Configure agent
agent = Agent(
    name="my_test_agent",
    instruction="Test thoroughly",
    server_names=["my_server"]
)
use_agent(agent)

Advanced programmatic control

from mcp_eval.config import (
    load_config,
    get_settings,
    use_config,
    ProgrammaticDefaults
)

# Load specific config file
config = load_config("configs/staging.yaml")
use_config(config)

# Modify settings at runtime
current = get_settings()
current.execution.timeout_seconds = 600
current.reporting.formats.append("csv")

# Set programmatic defaults
defaults = ProgrammaticDefaults()
defaults.set_agent_factory(lambda: create_custom_agent())
defaults.set_default_servers(["server1", "server2"])

# Context manager for temporary config
from mcp_eval.config import config_context

with config_context({"provider": "openai", "model": "gpt-4"}):
    # Tests here use OpenAI
    run_tests()
# Back to original config

Environment variable reference

Complete list of environment variables:

# Provider settings
ANTHROPIC_API_KEY="sk-ant-..."
OPENAI_API_KEY="sk-..."
MCP_EVAL_PROVIDER="anthropic"
MCP_EVAL_MODEL="claude-3-5-sonnet-20241022"

# Execution settings
MCP_EVAL_TIMEOUT="300"
MCP_EVAL_MAX_CONCURRENCY="5"
MCP_EVAL_RETRY_COUNT="3"
MCP_EVAL_DEBUG="true"

# Reporting
MCP_EVAL_OUTPUT_DIR="./reports"
MCP_EVAL_REPORT_FORMATS="json,html,markdown"

# Judge settings
MCP_EVAL_JUDGE_MODEL="claude-3-5-sonnet-20241022"
MCP_EVAL_JUDGE_MIN_SCORE="0.8"

# Development
MCP_EVAL_MOCK_LLM="false"
MCP_EVAL_SAVE_TRACES="true"
MCP_EVAL_PROFILE="false"

# Logging
MCP_EVAL_LOG_LEVEL="INFO"
MCP_EVAL_LOG_FILE="mcp-eval.log"

Configuration validation

Ensure your configuration is correct:

Using the validate command

# Validate all configuration
mcp-eval validate

# Validate specific aspects
mcp-eval validate --servers
mcp-eval validate --agents

Programmatic validation

from mcp_eval.config import validate_config

# Validate configuration
errors = validate_config("mcpeval.yaml")
if errors:
    print("Configuration errors:")
    for error in errors:
        print(f"  - {error}")
    sys.exit(1)

Schema validation

# Add schema reference for IDE support
$schema: "./schema/mcpeval.config.schema.json"

# Your configuration here...

Best practices

Follow these guidelines for maintainable configuration:

Keep secrets separate

Never commit API keys. Use .secrets.yaml files and add to .gitignore

Use environment layers

Create dev, staging, and prod configs that extend a base configuration

Document settings

Add comments explaining non-obvious configuration choices

Validate regularly

Run mcp-eval validate in CI to catch configuration issues early

Version control configs

Track configuration changes except for secrets files

Use defaults wisely

Set sensible defaults but allow overrides for flexibility

Troubleshooting configuration

Common configuration issues and solutions:

Issue	Solution
Config not found	Check file name and location, use `--config` flag
Invalid YAML	Validate syntax with `yamllint` or online validator
Server won’t start	Check command path, permissions, and dependencies
API key errors	Verify key in secrets file or environment variable
Wrong model used	Check precedence: code > env > config file
Timeout too short	Increase `execution.timeout_seconds`

Configuration examples

Minimal testing setup

# Quick start configuration
provider: "anthropic"
model: "claude-3-haiku-20240307"
mcp:
  servers:
    my_server:
      command: "python"
      args: ["server.py"]

Comprehensive testing suite

See the complete example at the beginning of this guide.

Multi-environment setup

# Directory structure
configs/
  ├── base.yaml       # Shared configuration
  ├── dev.yaml        # Development overrides
  ├── staging.yaml    # Staging overrides
  ├── prod.yaml       # Production settings
  └── secrets.yaml    # API keys (gitignored)

You’re now a configuration expert! With this knowledge, you can tune mcp-eval to work perfectly for your specific testing needs. Remember: start simple and add complexity as needed! 🎯

Getting Started

Core Concepts

Writing Tests

Building with LLMs

Evaluation Guides

Configuration

CI/CD & Deployment

Test Reporting

API Reference

CLI Reference

Resources

Documentation Index

​Quick configuration finder

Basic Setup

Servers

Agents

Providers

Testing

Reporting

​Configuration overview

​File precedence (later overrides earlier)

​File discovery

​Basic configuration

​Complete mcpeval.yaml example

​Minimal configuration

​Server configuration

​Basic server setup

​Advanced server options

​Importing servers from other sources

​Agent configuration

​Agent specifications

​Agent selection strategies

​Provider configuration

​Anthropic configuration

​OpenAI configuration

​Environment variable overrides

​Test execution configuration

​Execution strategies

​Performance optimization

​Reporting configuration

​Output formats and locations

​Custom report templates

​Judge configuration

​Environment-specific configuration

​Development configuration

​CI/CD configuration

​Production configuration

​Programmatic configuration

​Basic programmatic setup

​Advanced programmatic control

​Environment variable reference

​Configuration validation

​Using the validate command

​Programmatic validation

​Schema validation

​Best practices

Keep secrets separate

Use environment layers

Document settings

Validate regularly

Version control configs

Use defaults wisely

​Troubleshooting configuration

​Configuration examples

​Minimal testing setup

​Comprehensive testing suite

​Multi-environment setup

Quick configuration finder

Configuration overview

File precedence (later overrides earlier)

File discovery

Basic configuration

Complete mcpeval.yaml example

Minimal configuration

Server configuration

Basic server setup

Advanced server options

Importing servers from other sources

Agent configuration

Agent specifications

Agent selection strategies

Provider configuration

Anthropic configuration

OpenAI configuration

Environment variable overrides

Test execution configuration

Execution strategies

Performance optimization

Reporting configuration

Output formats and locations

Custom report templates

Judge configuration

Environment-specific configuration

Development configuration

CI/CD configuration

Production configuration

Programmatic configuration

Basic programmatic setup

Advanced programmatic control

Environment variable reference

Configuration validation

Using the validate command

Programmatic validation

Schema validation

Best practices

Troubleshooting configuration

Configuration examples

Minimal testing setup

Comprehensive testing suite

Multi-environment setup