Skip to content

Evaluator Configuration

The [evaluator] section controls the model evaluation process, which measures the quality of your fine-tuned model's outputs using various metrics.

Basic Configuration

[evaluator]
# Evaluation metrics to compute
metrics = ["bleu_score", "rouge_score", "semantic_similarity"]

# LLM for advanced evaluation (optional)
llm = "gpt-4o-mini"
embedding = "text-embeddings-3-small"

# Run configuration
run_name = "null"
run_name_prefix = "eval-"
run_name_suffix = ""

Supported Metrics

Traditional Metrics

[evaluator]
metrics = [
    "bleu_score",        # Translation quality
    "rouge_score",       # Summarization quality
]

BLEU Score: - Range: 0-100 (higher is better) - Good for: Translation, text generation - Measures: Word overlap with reference

ROUGE Score: - Range: 0-1 (higher is better)
- Good for: Summarization, paraphrasing - Measures: N-gram recall

LLM-Based Metrics

[evaluator]
metrics = [
    "factual_correctness",  # Fact verification
    "semantic_similarity",  # Meaning similarity
    "answer_accuracy",      # Answer correctness
    "answer_relevancy",     # Response relevance
    "answer_correctness"    # Overall correctness
]

# Required for LLM-based metrics
llm = "gpt-4o-mini"
embedding = "text-embeddings-3-small"

Factual Correctness: - Uses LLM to verify factual accuracy - Good for: QA, factual content - Expensive but highly accurate

Semantic Similarity: - Uses embeddings to measure meaning similarity - Good for: Paraphrasing, content similarity - Fast and cost-effective

LLM Configuration

OpenAI Models

[evaluator]
# Recommended models
llm = "gpt-4o-mini"      # Cost-effective, good quality
# llm = "gpt-4o"         # Higher quality, more expensive
# llm = "gpt-3.5-turbo"  # Fastest, lower cost

# Embedding models
embedding = "text-embeddings-3-small"  # Recommended
# embedding = "text-embeddings-3-large" # Higher quality

Cost Considerations

Model Cost/1K tokens Use Case
gpt-4o-mini $0.15/$0.60 Development, large-scale evaluation
gpt-4o $5.00/$15.00 Production, highest quality
gpt-3.5-turbo $0.50/$1.50 Budget-conscious evaluation

Run Configuration

Naming Convention

[evaluator]
# Custom run name
run_name = "experiment-v1"

# Or auto-generated with prefix/suffix
run_name = "null"
run_name_prefix = "eval-qa-"
run_name_suffix = "-final"
# Results in: eval-qa-20250629-143022-final

Input Data Format

The evaluator expects:

  1. Predictions file: inferencer_output.jsonl
  2. Ground truth data: From your test dataset

Expected Format

{
    "question": "What is machine learning?",
    "predicted_answer": "Machine learning is a subset of AI...",
    "ground_truth": "ML is a method of data analysis...",
    "metadata": {...}
}

Output Files

The evaluator generates:

1. Summary Report (evaluator_output_summary.json)

{
    "overall_scores": {
        "bleu_score": 25.4,
        "rouge_score": 0.68,
        "semantic_similarity": 0.82
    },
    "metadata": {
        "total_samples": 100,
        "evaluation_time": "2023-06-29T14:30:22",
        "metrics_used": ["bleu_score", "rouge_score"]
    }
}

2. Detailed Report (evaluator_output_detailed.xlsx)

Excel file with: - Individual scores per sample - Question-answer pairs - Metric breakdowns - Statistical analysis

Performance Configuration

Speed Optimization

[evaluator]
# Use only fast metrics
metrics = ["bleu_score", "rouge_score"]

# Skip LLM-based evaluation
llm = "null"

Quality Optimization

[evaluator]
# Use comprehensive metrics
metrics = [
    "bleu_score", 
    "rouge_score", 
    "factual_correctness",
    "semantic_similarity",
    "answer_accuracy"
]

# Use high-quality models
llm = "gpt-4o"
embedding = "text-embeddings-3-large"

Usage Examples

Basic Evaluation

# Evaluate with traditional metrics only
uv run app/evaluator.py

Full Evaluation with LLM

# Include LLM-based metrics
uv run app/evaluator.py --openai-key "your_openai_key"

Programmatic Usage

from app.evaluator import Evaluator

# Initialize evaluator
evaluator = Evaluator()

# Run evaluation
results = evaluator.run()
print(f"Overall BLEU score: {results['bleu_score']}")

Metric Interpretation

BLEU Score Guidelines

Score Range Quality Interpretation
0-10 Poor Almost no overlap with reference
10-20 Fair Some overlap, needs improvement
20-40 Good Reasonable quality
40+ Excellent High-quality generation

ROUGE Score Guidelines

Score Range Quality Interpretation
0-0.2 Poor Low recall of important content
0.2-0.4 Fair Moderate content coverage
0.4-0.6 Good Good content recall
0.6+ Excellent High content overlap

Semantic Similarity Guidelines

Score Range Quality Interpretation
0-0.5 Poor Different meaning
0.5-0.7 Fair Related but different
0.7-0.85 Good Similar meaning
0.85+ Excellent Nearly identical meaning

Configuration Examples

Research Evaluation

[evaluator]
# Comprehensive metrics for research
metrics = [
    "bleu_score",
    "rouge_score", 
    "factual_correctness",
    "semantic_similarity",
    "answer_accuracy",
    "answer_relevancy"
]

llm = "gpt-4o"
embedding = "text-embeddings-3-large"
run_name_prefix = "research-"

Production Monitoring

[evaluator]
# Fast metrics for regular monitoring
metrics = ["bleu_score", "semantic_similarity"]

llm = "gpt-4o-mini"
embedding = "text-embeddings-3-small"
run_name_prefix = "prod-monitor-"

Budget-Conscious Evaluation

[evaluator]
# Traditional metrics only (no API costs)
metrics = ["bleu_score", "rouge_score"]

llm = "null"  # Disable LLM-based metrics
embedding = "null"

Error Handling

API Key Issues

# Error: OpenAI API key not provided
export OPENAI_API_KEY="your_key"
# or
uv run app/evaluator.py --openai-key "your_key"

Rate Limiting

[evaluator]
# Use cheaper model to avoid rate limits
llm = "gpt-4o-mini"

# Or disable LLM metrics
metrics = ["bleu_score", "rouge_score"]

Data Format Issues

Ensure your data matches the expected format: - Questions and answers are strings - Ground truth is available - No missing or null values in required fields