Fine-Tuner Configuration¶
The [fine_tuner]
section controls all aspects of the model fine-tuning process. This page provides comprehensive documentation of all available configuration options.
Model Configuration¶
Base Model Settings¶
[fine_tuner]
# Required: Hugging Face model ID or local path
base_model_id = "unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit"
# Maximum sequence length for training
max_sequence_length = 4096
# Data type for model weights (null for auto-detection)
dtype = "null" # Options: "float16", "bfloat16", null
# Quantization settings (choose one)
load_in_4bit = true
load_in_8bit = false
# Whether to use full fine-tuning instead of LoRA
full_finetuning = false
Model Resource Recommendations¶
Recommended Model | Memory Requirement |
---|---|
unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit |
2GB |
unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit |
4GB |
unsloth/Qwen2.5-3B-Instruct-bnb-4bit |
8GB |
unsloth/Qwen2.5-7B-Instruct-bnb-4bit |
16GB |
LoRA Configuration¶
Low-Rank Adaptation (LoRA) settings for parameter-efficient fine-tuning:
[fine_tuner]
# LoRA rank - higher values = more trainable parameters
rank = 16 # Typical values: 8, 16, 32, 64, ...
# LoRA alpha - scaling factor for LoRA updates
lora_alpha = 16 # Usually equal to rank
# Dropout rate for LoRA layers
lora_dropout = 0.1 # Typical Range: 0.0 - 0.3
# Target modules for LoRA adaptation
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention layers
"gate_proj", "up_proj", "down_proj" # Feed-forward layers
]
# Bias handling
bias = "none" # Options: "none", "all", "lora_only"
# Advanced LoRA options
use_rslora = false # Rank-Stabilized LoRA
loftq_config = "null" # LoFTQ configuration
LoRA Performance Guide¶
Rank | Parameters | Speed | Quality | Use Case |
---|---|---|---|---|
8 | ~0.5M | Fast | Good | Quick prototyping |
16 | ~1M | Medium | Better | General purpose |
32 | ~2M | Slower | High | Quality-focused |
64 | ~4M | Slowest | Highest | Research/Production |
Dataset Configuration¶
Data Sources¶
[fine_tuner]
# Required: Training dataset
training_data_id = "your-huggingface-username/training-dataset"
# Optional: Validation dataset
validation_data_id = "your-huggingface-username/validation-dataset" # or "null"
# Number of processes for dataset loading
dataset_num_proc = 4
Column Mapping¶
Map your dataset columns to the expected format:
[fine_tuner]
# Required columns
question_column = "question" # Input/instruction column
ground_truth_column = "answer" # Target/response column
# Optional system prompt
system_prompt_column = "system" # System prompt column (or "null")
# Override system prompt for all examples
system_prompt_override_text = "null" # Custom system prompt (or "null")
Dataset Format Examples¶
For domain adaptation tasks.
Training Parameters¶
Basic Training Settings¶
[fine_tuner]
# Number of training epochs
epochs = 30
# Learning rate
learning_rate = 0.0002 # Typical range: 1e-5 to 5e-4
# Batch sizes
device_train_batch_size = 4 # Per-device batch size
device_validation_batch_size = 4 # Validation batch size
grad_accumulation = 4 # Gradient accumulation steps
# Warmup and scheduling
warmup_steps = 5 # Learning rate warmup
lr_scheduler_type = "linear" # Options: "linear", "cosine", "constant"
# Optimization
optimizer = "paged_adamw_8bit" # Memory-efficient optimizer
weight_decay = 0.01 # L2 regularization
# Random seed for reproducibility
seed = 42
Advanced Training Options¶
[fine_tuner]
# Memory optimization
use_gradient_checkpointing = "unsloth" # Options: true, false, "unsloth"
use_flash_attention = true # Flash attention for efficiency
packing = false # Pack multiple sequences per batch
# Training on responses only
train_on_responses_only = true
question_part = "<|im_start|>user\n" # Question template
answer_part = "<|im_start|>assistant\n" # Answer template
Training on Responses Only
This option allows training the model on the responses directly, which results in higher accuracy. Typically the loss is calculated for the entire completion text, including the question and the system prompt that we provide. However, this option allows us to train the model on the model response section only, which results in higher accuracy.
Logging and Monitoring¶
Mlflow Integration¶
[fine_tuner]
# Logging configuration
log_steps = 10 # Log metrics every N steps
log_first_step = true # Log the first step
report_to = "mlflow" # Reporting backend: "wandb", "tensorboard", "mlflow", "none"
[mlflow]
# MLflow tracking URI and experiment settings
tracking_uri = "https://your-mlflow-tracking-uri"
experiment_name = "your-experiment-name"
run_name = "your-run-name" # Custom run name or "null" for auto. Recommended to use a versioning scheme like "0.0.1"
Model Saving¶
[fine_tuner]
# Checkpoint saving
save_steps = 20 # Save checkpoint every N steps
save_total_limit = 3 # Maximum number of checkpoints to keep (older ones will be deleted)
# Hugging Face Hub integration
push_to_hub = true # Push final model to Hub
Run Naming¶
Control how your training runs are named:
[mlflow]
# Run name configuration
run_name = "0.0.1" # Custom run name (or "null" for auto)
run_name_prefix = "" # Prefix for auto-generated names
run_name_suffix = "" # Suffix for auto-generated names
Recommendation
Use a versioning scheme like 0.0.1
for run_name
to easily track changes across runs. You can also use prefixes and suffixes to add context, e.g., prefix exp-
for experiments or suffix -alpha
for further versioning. (e.g., exp-0.0.1-v1
).
Run Name Examples¶
Configuration | Generated Name |
---|---|
run_name = "my-model" |
my-model |
run_name_prefix = "exp-" |
exp-20250629-143022 |
run_name_suffix = "-v1" |
20250629-143022-v1 |
Memory Optimization Guide¶
For 4GB GPU (RTX 3060)¶
[fine_tuner]
base_model_id = "unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit"
max_sequence_length = 1024
load_in_4bit = true
device_train_batch_size = 1
grad_accumulation = 16
use_gradient_checkpointing = "unsloth"
rank = 8
For 8GB GPU (RTX 3070)¶
[fine_tuner]
base_model_id = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"
max_sequence_length = 2048
load_in_4bit = true
device_train_batch_size = 2
grad_accumulation = 8
use_gradient_checkpointing = "unsloth"
rank = 16
For 12GB+ GPU (RTX 3080 Ti/4070 Ti)¶
[fine_tuner]
base_model_id = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"
max_sequence_length = 4096
load_in_4bit = true
device_train_batch_size = 4
grad_accumulation = 4
use_gradient_checkpointing = "unsloth"
rank = 32
Performance Tuning¶
Speed Optimization¶
[fine_tuner]
# Enable packing for 5x speed improvement on short sequences
packing = true
# Use flash attention
use_flash_attention = true
# Optimize data loading
dataset_num_proc = 8 # Match your CPU cores
# Efficient precision
dtype = "null" # Auto-select best precision
Quality Optimization¶
[fine_tuner]
# Higher LoRA rank for better quality
rank = 64
lora_alpha = 32
# More training epochs
epochs = 50
# Lower learning rate for stability
learning_rate = 0.0001
# Add validation dataset
validation_data_id = "your-username/validation-dataset"
Common Configuration Patterns¶
Research/Experimentation¶
Production Training¶
[fine_tuner]
epochs = 30
device_train_batch_size = 8
push_to_hub = true
save_steps = 100
save_total_limit = 5
Memory-Constrained¶
[fine_tuner]
load_in_4bit = true
device_train_batch_size = 1
grad_accumulation = 32
use_gradient_checkpointing = "unsloth"
max_sequence_length = 1024
Troubleshooting¶
Out of Memory Errors¶
- Reduce
device_train_batch_size
- Increase
grad_accumulation
to maintain effective batch size - Reduce
max_sequence_length
- Enable
use_gradient_checkpointing
- Use smaller model or higher quantization
Slow Training¶
- Enable
packing = true
- Enable
use_flash_attention = true
- Increase
dataset_num_proc
- Use larger
device_train_batch_size
if memory allows - Consider using a smaller model for prototyping
Poor Quality Results¶
- Increase
rank
andlora_alpha
- Add validation dataset
- Increase
epochs
- Lower
learning_rate
- Check data quality and format