Skip to content

API Reference

This page provides detailed API documentation for the Fine-Tune Pipeline components.

Core Classes

FineTune

The main class for fine-tuning language models.

app.finetuner.FineTune

apply_chat_template_to_conversations(data_rows)

    Format the conversations for training by applying the chat template.
    Args:
        data_rows (dict): A batch of data_rows from the dataset.
    Returns:
        dict: A dictionary containing the formatted text for training.
    Example:
        {
            "text": [
                "<|im_start|>system

System prompt here<|im_end|> <|im_start|>user User question here<|im_end|> <|im_start|>assistant ", "<|im_start|>system System prompt here<|im_end|> <|im_start|>user User question here<|im_end|> <|im_start|>assistant ", ... ] }

convert_a_data_row_to_conversation_format(data_row)

Convert a single data_row to a conversation format. Args: data_row (dict): A single data_row from the dataset. Returns: dict: A dictionary containing the conversation format (System prompt is optional). Example: { "conversations": [ {"role": "system", "content": "System prompt here"}, {"role": "user", "content": "User question here"}, {"role": "assistant", "content": "Assistant answer here"} ] }

get_columns_to_remove(dataset, dataset_id)

Get the columns to remove from the dataset based on the configuration. Args: dataset_columns (list[str]): The list of columns in the dataset. Returns: list[str]: The list of columns to remove.

get_peft_model()

Convert the loaded model to a PEFT (Parameter-Efficient Fine-Tuning) model. Returns: FastLanguageModel: The PEFT model.

handle_wandb_setup()

Handle the setup for Weights & Biases (wandb) logging. Returns: str: The run name used for the Weights & Biases run.

load_base_model_and_tokenizer()

Load the base model and tokenizer from the specified model name. Returns: FastLanguageModel: The loaded model. Tokenizer: The tokenizer associated with the model.

run()

Run the fine-tuning process. Returns: TrainerStats: The statistics from the training process.

ConfigManager

Centralized configuration management.

app.config_manager.ConfigManager

Centralized configuration manager for the pipeline.

get_section(section)

Get a specific configuration section.

get_value(section, key, default=None)

Get a specific configuration value.

load_config()

Load configuration from TOML file.

validate_dataclass_config(section, dataclass_type)

Validate that a section contains all fields required by a dataclass.

validate_section(section, required_keys)

Validate that a section contains all required keys.

Configuration Classes

FineTunerConfig

Configuration dataclass for fine-tuning parameters.

app.config_manager.FineTunerConfig dataclass

Utility Functions

load_huggingface_dataset

Load datasets from Hugging Face Hub.

app.utils.load_huggingface_dataset(dataset_id)

Load a dataset from Hugging Face. Whether the data is jsonl, csv, parquet or any other format, it will be loaded as a Hugging Face Dataset. Args: dataset_id (str): The ID of the dataset on Hugging Face. Returns: datasets.arrow_dataset.Dataset: The loaded dataset.

login_huggingface

Authenticate with Hugging Face Hub.

app.utils.login_huggingface()

Log in to Hugging Face using the token from environment variables. Raises: ValueError: If the Hugging Face token is not set in the environment variables.

setup_run_name

Generate unique run names for experiments.

app.utils.setup_run_name(*, name, prefix='', suffix='')

Set up the run name for the training process. The run name is constructed from the base model ID, project name, and optional prefixes/suffixes.

Usage Examples

Basic Fine-Tuning

from app.finetuner import FineTune
from app.config_manager import get_config_manager

# Initialize with default configuration
tuner = FineTune()

# Run fine-tuning
stats = tuner.run()
print(f"Training completed with stats: {stats}")

Custom Configuration

from app.config_manager import ConfigManager, FineTunerConfig
from app.finetuner import FineTune

# Load custom configuration
config_manager = ConfigManager("custom_config.toml")
config = FineTunerConfig.from_config(config_manager)

# Initialize tuner with custom config
tuner = FineTune(config_manager=config_manager)

# Access configuration
print(f"Base model: {config.base_model_id}")
print(f"Epochs: {config.epochs}")
print(f"Learning rate: {config.learning_rate}")

# Run training
stats = tuner.run()

Programmatic Configuration

from app.config_manager import FineTunerConfig
from app.finetuner import FineTune

# Create configuration programmatically
config = FineTunerConfig(
    base_model_id="unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit",
    training_data_id="your-username/your-dataset",
    epochs=3,
    learning_rate=0.0002,
    device_train_batch_size=4,
    rank=16,
    lora_alpha=16
)

# Use configuration
tuner = FineTune(config=config)
stats = tuner.run()

Error Handling

The API includes comprehensive error handling:

from app.finetuner import FineTune
from app.config_manager import ConfigManager

try:
    config_manager = ConfigManager("config.toml")
    tuner = FineTune(config_manager=config_manager)
    stats = tuner.run()

except FileNotFoundError as e:
    print(f"Configuration file not found: {e}")

except ValueError as e:
    print(f"Invalid configuration: {e}")

except RuntimeError as e:
    print(f"Training error: {e}")

except Exception as e:
    print(f"Unexpected error: {e}")

Type Hints

The API uses comprehensive type hints for better IDE support:

from typing import Dict, Any, Optional, Tuple
from transformers import PreTrainedTokenizerBase
from unsloth import FastLanguageModel

def load_model_and_tokenizer(
    model_id: str,
    max_seq_length: int = 4096,
    dtype: Optional[str] = None
) -> Tuple[FastLanguageModel, PreTrainedTokenizerBase]:
    """Load model and tokenizer with type safety."""
    pass

def process_dataset(
    dataset_id: str,
    config: Dict[str, Any]
) -> Dict[str, Any]:
    """Process dataset with configuration."""
    pass

Constants and Enums

Model Types

SUPPORTED_MODEL_TYPES = [
    "qwen",
    "llama", 
    "mistral",
    "gemma",
    "phi"
]

Optimizers

SUPPORTED_OPTIMIZERS = [
    "adamw_torch",
    "adamw_hf",
    "paged_adamw_8bit",
    "paged_adamw_32bit"
]

Schedulers

SUPPORTED_SCHEDULERS = [
    "linear",
    "cosine",
    "cosine_with_restarts",
    "polynomial",
    "constant",
    "constant_with_warmup"
]

Configuration Schema

Complete Configuration Example

{
    "fine_tuner": {
        # Model configuration
        "base_model_id": str,
        "max_sequence_length": int,
        "dtype": Optional[str],
        "load_in_4bit": bool,
        "load_in_8bit": bool,
        "full_finetuning": bool,

        # LoRA configuration
        "rank": int,
        "lora_alpha": int,
        "lora_dropout": float,
        "target_modules": List[str],
        "bias": str,
        "use_rslora": bool,
        "loftq_config": Optional[str],

        # Dataset configuration
        "training_data_id": str,
        "validation_data_id": Optional[str],
        "dataset_num_proc": int,
        "question_column": str,
        "ground_truth_column": str,
        "system_prompt_column": Optional[str],
        "system_prompt_override_text": Optional[str],

        # Training configuration
        "epochs": int,
        "learning_rate": float,
        "device_train_batch_size": int,
        "device_validation_batch_size": int,
        "grad_accumulation": int,
        "warmup_steps": int,
        "optimizer": str,
        "weight_decay": float,
        "lr_scheduler_type": str,
        "seed": int,

        # Logging configuration
        "log_steps": int,
        "log_first_step": bool,
        "save_steps": int,
        "save_total_limit": int,
        "push_to_hub": bool,
        "report_to": str,
        "wandb_project_name": str,

        # Advanced configuration
        "packing": bool,
        "use_gradient_checkpointing": Union[bool, str],
        "use_flash_attention": bool,
        "train_on_responses_only": bool,
        "question_part": str,
        "answer_part": str,

        # Run naming
        "run_name": Optional[str],
        "run_name_prefix": str,
        "run_name_suffix": str
    }
}

Environment Variables

The API recognizes these environment variables:

Variable Purpose Required
HF_TOKEN Hugging Face authentication Yes
WANDB_TOKEN Weights & Biases API key Yes
OPENAI_API_KEY OpenAI API key for evaluation Optional
TRANSFORMERS_CACHE HuggingFace cache directory No
CUDA_VISIBLE_DEVICES GPU device selection No

Performance Tips

Memory Optimization

# For low-memory systems
config = {
    "load_in_4bit": True,
    "device_train_batch_size": 1,
    "grad_accumulation": 16,
    "use_gradient_checkpointing": "unsloth",
    "max_sequence_length": 1024
}

Speed Optimization

# For faster training
config = {
    "packing": True,
    "use_flash_attention": True,
    "dataset_num_proc": 8,
    "dtype": None  # Auto-select best precision
}

Quality Optimization

# For better results
config = {
    "rank": 64,
    "lora_alpha": 32,
    "epochs": 5,
    "learning_rate": 0.0001,
    "validation_data_id": "your-validation-set"
}