API Reference¶
This page provides detailed API documentation for the Fine-Tune Pipeline components.
Core Classes¶
FineTune¶
The main class for fine-tuning language models.
app.finetuner.FineTune
¶
apply_chat_template_to_conversations(data_rows)
¶
Format the conversations for training by applying the chat template.
Args:
data_rows (dict): A batch of data_rows from the dataset.
Returns:
dict: A dictionary containing the formatted text for training.
Example:
{
"text": [
"<|im_start|>system
System prompt here<|im_end|> <|im_start|>user User question here<|im_end|> <|im_start|>assistant ", "<|im_start|>system System prompt here<|im_end|> <|im_start|>user User question here<|im_end|> <|im_start|>assistant ", ... ] }
convert_a_data_row_to_conversation_format(data_row)
¶
Convert a single data_row to a conversation format. Args: data_row (dict): A single data_row from the dataset. Returns: dict: A dictionary containing the conversation format (System prompt is optional). Example: { "conversations": [ {"role": "system", "content": "System prompt here"}, {"role": "user", "content": "User question here"}, {"role": "assistant", "content": "Assistant answer here"} ] }
get_columns_to_remove(dataset, dataset_id)
¶
Get the columns to remove from the dataset based on the configuration. Args: dataset_columns (list[str]): The list of columns in the dataset. Returns: list[str]: The list of columns to remove.
get_peft_model()
¶
Convert the loaded model to a PEFT (Parameter-Efficient Fine-Tuning) model. Returns: FastLanguageModel: The PEFT model.
handle_wandb_setup()
¶
Handle the setup for Weights & Biases (wandb) logging. Returns: str: The run name used for the Weights & Biases run.
load_base_model_and_tokenizer()
¶
Load the base model and tokenizer from the specified model name. Returns: FastLanguageModel: The loaded model. Tokenizer: The tokenizer associated with the model.
run()
¶
Run the fine-tuning process. Returns: TrainerStats: The statistics from the training process.
ConfigManager¶
Centralized configuration management.
app.config_manager.ConfigManager
¶
Centralized configuration manager for the pipeline.
get_section(section)
¶
Get a specific configuration section.
get_value(section, key, default=None)
¶
Get a specific configuration value.
load_config()
¶
Load configuration from TOML file.
validate_dataclass_config(section, dataclass_type)
¶
Validate that a section contains all fields required by a dataclass.
validate_section(section, required_keys)
¶
Validate that a section contains all required keys.
Configuration Classes¶
FineTunerConfig¶
Configuration dataclass for fine-tuning parameters.
app.config_manager.FineTunerConfig
dataclass
¶
Utility Functions¶
load_huggingface_dataset¶
Load datasets from Hugging Face Hub.
app.utils.load_huggingface_dataset(dataset_id)
¶
Load a dataset from Hugging Face. Whether the data is jsonl, csv, parquet or any other format, it will be loaded as a Hugging Face Dataset. Args: dataset_id (str): The ID of the dataset on Hugging Face. Returns: datasets.arrow_dataset.Dataset: The loaded dataset.
login_huggingface¶
Authenticate with Hugging Face Hub.
app.utils.login_huggingface()
¶
Log in to Hugging Face using the token from environment variables. Raises: ValueError: If the Hugging Face token is not set in the environment variables.
setup_run_name¶
Generate unique run names for experiments.
app.utils.setup_run_name(*, name, prefix='', suffix='')
¶
Set up the run name for the training process. The run name is constructed from the base model ID, project name, and optional prefixes/suffixes.
Usage Examples¶
Basic Fine-Tuning¶
from app.finetuner import FineTune
from app.config_manager import get_config_manager
# Initialize with default configuration
tuner = FineTune()
# Run fine-tuning
stats = tuner.run()
print(f"Training completed with stats: {stats}")
Custom Configuration¶
from app.config_manager import ConfigManager, FineTunerConfig
from app.finetuner import FineTune
# Load custom configuration
config_manager = ConfigManager("custom_config.toml")
config = FineTunerConfig.from_config(config_manager)
# Initialize tuner with custom config
tuner = FineTune(config_manager=config_manager)
# Access configuration
print(f"Base model: {config.base_model_id}")
print(f"Epochs: {config.epochs}")
print(f"Learning rate: {config.learning_rate}")
# Run training
stats = tuner.run()
Programmatic Configuration¶
from app.config_manager import FineTunerConfig
from app.finetuner import FineTune
# Create configuration programmatically
config = FineTunerConfig(
base_model_id="unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit",
training_data_id="your-username/your-dataset",
epochs=3,
learning_rate=0.0002,
device_train_batch_size=4,
rank=16,
lora_alpha=16
)
# Use configuration
tuner = FineTune(config=config)
stats = tuner.run()
Error Handling¶
The API includes comprehensive error handling:
from app.finetuner import FineTune
from app.config_manager import ConfigManager
try:
config_manager = ConfigManager("config.toml")
tuner = FineTune(config_manager=config_manager)
stats = tuner.run()
except FileNotFoundError as e:
print(f"Configuration file not found: {e}")
except ValueError as e:
print(f"Invalid configuration: {e}")
except RuntimeError as e:
print(f"Training error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Type Hints¶
The API uses comprehensive type hints for better IDE support:
from typing import Dict, Any, Optional, Tuple
from transformers import PreTrainedTokenizerBase
from unsloth import FastLanguageModel
def load_model_and_tokenizer(
model_id: str,
max_seq_length: int = 4096,
dtype: Optional[str] = None
) -> Tuple[FastLanguageModel, PreTrainedTokenizerBase]:
"""Load model and tokenizer with type safety."""
pass
def process_dataset(
dataset_id: str,
config: Dict[str, Any]
) -> Dict[str, Any]:
"""Process dataset with configuration."""
pass
Constants and Enums¶
Model Types¶
Optimizers¶
Schedulers¶
SUPPORTED_SCHEDULERS = [
"linear",
"cosine",
"cosine_with_restarts",
"polynomial",
"constant",
"constant_with_warmup"
]
Configuration Schema¶
Complete Configuration Example¶
{
"fine_tuner": {
# Model configuration
"base_model_id": str,
"max_sequence_length": int,
"dtype": Optional[str],
"load_in_4bit": bool,
"load_in_8bit": bool,
"full_finetuning": bool,
# LoRA configuration
"rank": int,
"lora_alpha": int,
"lora_dropout": float,
"target_modules": List[str],
"bias": str,
"use_rslora": bool,
"loftq_config": Optional[str],
# Dataset configuration
"training_data_id": str,
"validation_data_id": Optional[str],
"dataset_num_proc": int,
"question_column": str,
"ground_truth_column": str,
"system_prompt_column": Optional[str],
"system_prompt_override_text": Optional[str],
# Training configuration
"epochs": int,
"learning_rate": float,
"device_train_batch_size": int,
"device_validation_batch_size": int,
"grad_accumulation": int,
"warmup_steps": int,
"optimizer": str,
"weight_decay": float,
"lr_scheduler_type": str,
"seed": int,
# Logging configuration
"log_steps": int,
"log_first_step": bool,
"save_steps": int,
"save_total_limit": int,
"push_to_hub": bool,
"report_to": str,
"wandb_project_name": str,
# Advanced configuration
"packing": bool,
"use_gradient_checkpointing": Union[bool, str],
"use_flash_attention": bool,
"train_on_responses_only": bool,
"question_part": str,
"answer_part": str,
# Run naming
"run_name": Optional[str],
"run_name_prefix": str,
"run_name_suffix": str
}
}
Environment Variables¶
The API recognizes these environment variables:
Variable | Purpose | Required |
---|---|---|
HF_TOKEN |
Hugging Face authentication | Yes |
WANDB_TOKEN |
Weights & Biases API key | Yes |
OPENAI_API_KEY |
OpenAI API key for evaluation | Optional |
TRANSFORMERS_CACHE |
HuggingFace cache directory | No |
CUDA_VISIBLE_DEVICES |
GPU device selection | No |
Performance Tips¶
Memory Optimization¶
# For low-memory systems
config = {
"load_in_4bit": True,
"device_train_batch_size": 1,
"grad_accumulation": 16,
"use_gradient_checkpointing": "unsloth",
"max_sequence_length": 1024
}
Speed Optimization¶
# For faster training
config = {
"packing": True,
"use_flash_attention": True,
"dataset_num_proc": 8,
"dtype": None # Auto-select best precision
}