Evaluator Configuration¶

The [evaluator] section controls the model evaluation process, which measures the quality of your fine-tuned model's outputs using various metrics.

Basic Configuration¶

[evaluator]
# Evaluation metrics to compute
metrics = ["bleu_score", "rouge_score", "semantic_similarity"]

# LLM for advanced evaluation
llm = "gpt-4o-mini"
embedding = "text-embeddings-3-small"

Supported Metrics¶

Traditional Metrics¶

[evaluator]
metrics = [
    "bleu_score",        # Translation quality
    "rouge_score",       # Summarization quality
]

BLEU Score:

Range: 0-1 (higher is better)
Good for: Translation, text generation
Measures: Word overlap with reference

ROUGE Score:

Range: 0-1 (higher is better)
Good for: Summarization, paraphrasing
Measures: N-gram recall

LLM-Based Metrics¶

[evaluator]
metrics = [
    "factual_correctness",  # compares and evaluates the factual accuracy of the generated response with the reference
    "semantic_similarity",  # assessment of the semantic resemblance between the generated answer and the ground truth using a bi-encoder
    "answer_accuracy",      # measures the agreement between a model’s response and a reference ground truth for a given question.
    "answer_relevancy",     # measures how relevant a response is to the user input.
    "answer_correctness"    # gauging the accuracy of the generated answer when compared to the ground truth.
]

# Required for LLM-based metrics
llm = "gpt-4o-mini"
embedding = "text-embeddings-3-small"

Factual Correctness:

Uses LLM to verify factual accuracy
Good for: QA, factual content
Expensive but highly accurate

Semantic Similarity:

Uses embeddings to measure meaning similarity
Good for: Paraphrasing, content similarity
Fast and cost-effective

More details on LLM-based metrics can be found in the RAGAS documentation

LLM Configuration¶

OpenAI Models¶

[evaluator]
# Recommended models
llm = "gpt-4o-mini"      # Cost-effective, good quality
# llm = "gpt-4o"         # Higher quality, more expensive
# llm = "gpt-4.1"        # Latest model, best performance

# Embedding models
embedding = "text-embeddings-3-small"  # Recommended
# embedding = "text-embeddings-3-large" # Higher quality

Cost Considerations¶

Model	Cost/1K tokens	Use Case
gpt-4o-mini	$0.15/$0.60	Development, large-scale evaluation
gpt-4o	$5.00/$15.00	Production, highest quality
gpt-3.5-turbo	$0.50/$1.50	Budget-conscious evaluation

Run Configuration¶

Input Data Format¶

The evaluator expects:

Predictions file: inferencer_output.jsonl
Ground truth data: From your test dataset

Expected Format¶

{
    "question": "What is machine learning?",
    "predicted_answer": "Machine learning is a subset of AI...",
    "ground_truth": "ML is a method of data analysis...",
}

Output Files¶

The evaluator generates:

1. Summary Report (`evaluator_output_summary.json`)¶

{
    "bleu_score": 25.4,
    "rouge_score": 0.68,
    "semantic_similarity": 0.82,
    "factual_correctness": 0.90,
}

2. Detailed Report (`evaluator_output_detailed.xlsx`)¶

Excel file with:

Individual scores per sample
Question-answer pairs
Metric breakdowns
Statistical analysis

Performance Configuration¶

Speed Optimization¶

[evaluator]
# Use only fast metrics
metrics = ["bleu_score", "rouge_score"]

# Skip LLM-based evaluation
llm = "null"

Quality Optimization¶

[evaluator]
# Use comprehensive metrics
metrics = [
    "bleu_score", 
    "rouge_score", 
    "factual_correctness",
    "semantic_similarity",
    "answer_accuracy"
]

# Use high-quality models
llm = "gpt-4o"
embedding = "text-embeddings-3-large"

Metric Interpretation¶

BLEU Score Guidelines¶

Score Range	Quality	Interpretation
0-0.10	Poor	Almost no overlap with reference
0.10-0.20	Fair	Some overlap, needs improvement
0.20-0.40	Good	Reasonable quality
0.40+	Excellent	High-quality generation

ROUGE Score Guidelines¶

Score Range	Quality	Interpretation
0-0.2	Poor	Low recall of important content
0.2-0.4	Fair	Moderate content coverage
0.4-0.6	Good	Good content recall
0.6+	Excellent	High content overlap

Semantic Similarity Guidelines¶

Score Range	Quality	Interpretation
0-0.5	Poor	Different meaning
0.5-0.7	Fair	Related but different
0.7-0.85	Good	Similar meaning
0.85+	Excellent	Nearly identical meaning

Configuration Examples¶

Research Evaluation¶

[evaluator]
# Comprehensive metrics for research
metrics = [
    "bleu_score",
    "rouge_score", 
    "factual_correctness",
    "semantic_similarity",
    "answer_accuracy",
    "answer_relevancy"
]

llm = "gpt-4o"
embedding = "text-embeddings-3-large"
run_name_prefix = "research-"

Production Monitoring¶

[evaluator]
# Fast metrics for regular monitoring
metrics = ["bleu_score", "semantic_similarity"]

llm = "gpt-4o-mini"
embedding = "text-embeddings-3-small"
run_name_prefix = "prod-monitor-"

Budget-Conscious Evaluation¶

[evaluator]
# Traditional metrics only (no API costs)
metrics = ["bleu_score", "rouge_score"]

llm = "null"  # Disable LLM-based metrics
embedding = "null"

Error Handling¶

API Key Issues¶

Error: OpenAI API key not provided

Provide API key in github secrets.

Rate Limiting¶

[evaluator]
# Use cheaper model to avoid rate limits
llm = "gpt-4o-mini"

# Or disable LLM metrics
metrics = ["bleu_score", "rouge_score"]

Data Format Issues¶

Ensure your data matches the expected format:

Questions and answers are strings
Ground truth is available
No missing or null values in required fields