API Reference¶

This page provides detailed API documentation for the Fine-Tune Pipeline components.

Core Classes¶

FineTune¶

The main class for fine-tuning language models.

`app.finetuner.FineTune` ¶

`apply_chat_template_to_conversations(data_rows)` ¶

    Format the conversations for training by applying the chat template.
    Args:
        data_rows (dict): A batch of data_rows from the dataset.
    Returns:
        dict: A dictionary containing the formatted text for training.
    Example:
        {
            "text": [
                "<|im_start|>system

System prompt here<|im_end|> <|im_start|>user User question here<|im_end|> <|im_start|>assistant ", "<|im_start|>system System prompt here<|im_end|> <|im_start|>user User question here<|im_end|> <|im_start|>assistant ", ... ] }

`convert_a_data_row_to_conversation_format(data_row)` ¶

Convert a single data_row to a conversation format. Args: data_row (dict): A single data_row from the dataset. Returns: dict: A dictionary containing the conversation format (System prompt is optional). Example: { "conversations": [ {"role": "system", "content": "System prompt here"}, {"role": "user", "content": "User question here"}, {"role": "assistant", "content": "Assistant answer here"} ] }

`get_columns_to_remove(dataset, dataset_id)` ¶

Get the columns to remove from the dataset based on the configuration. Args: dataset_columns (list[str]): The list of columns in the dataset. Returns: list[str]: The list of columns to remove.

`get_peft_model()` ¶

Convert the loaded model to a PEFT (Parameter-Efficient Fine-Tuning) model. Returns: FastLanguageModel: The PEFT model.

`handle_wandb_setup()` ¶

Handle the setup for Weights & Biases (wandb) logging. Returns: str: The run name used for the Weights & Biases run.

`load_base_model_and_tokenizer()` ¶

Load the base model and tokenizer from the specified model name. Returns: FastLanguageModel: The loaded model. Tokenizer: The tokenizer associated with the model.

`run(run_name=None)` ¶

Run the fine-tuning process. Returns: TrainerStats: The statistics from the training process.

ConfigManager¶

Centralized configuration management.

`app.config_manager.ConfigManager` ¶

Centralized configuration manager for the pipeline.

`get_section(section)` ¶

Get a specific configuration section.

`get_value(section, key, default=None)` ¶

Get a specific configuration value.

`load_config()` ¶

Load configuration from TOML file.

`validate_dataclass_config(section, dataclass_type)` ¶

Validate that a section contains all fields required by a dataclass.

`validate_section(section, required_keys)` ¶

Validate that a section contains all required keys.

Configuration Classes¶

FineTunerConfig¶

Configuration dataclass for fine-tuning parameters.

`app.config_manager.FineTunerConfig` `dataclass` ¶

Utility Functions¶

load_huggingface_dataset¶

Load datasets from Hugging Face Hub.

`app.utils.load_huggingface_dataset(dataset_id)` ¶

Load a dataset from Hugging Face. Whether the data is jsonl, csv, parquet or any other format, it will be loaded as a Hugging Face Dataset. Args: dataset_id (str): The ID of the dataset on Hugging Face. Returns: datasets.arrow_dataset.Dataset: The loaded dataset.

login_huggingface¶

Authenticate with Hugging Face Hub.

`app.utils.login_huggingface()` ¶

Log in to Hugging Face using the token from environment variables. Raises: ValueError: If the Hugging Face token is not set in the environment variables.

setup_run_name¶

Generate unique run names for experiments.

`app.utils.setup_run_name(*, name, prefix='', suffix='')` ¶

Set up the run name for the training process. The run name is constructed from the base model ID, project name, and optional prefixes/suffixes.

Usage Examples¶

Basic Fine-Tuning¶

from app.finetuner import FineTune
from app.config_manager import get_config_manager

# Initialize with default configuration
tuner = FineTune()

# Run fine-tuning
stats = tuner.run()
print(f"Training completed with stats: {stats}")

Custom Configuration¶

from app.config_manager import ConfigManager, FineTunerConfig
from app.finetuner import FineTune

# Load custom configuration
config_manager = ConfigManager("custom_config.toml")
config = FineTunerConfig.from_config(config_manager)

# Initialize tuner with custom config
tuner = FineTune(config_manager=config_manager)

# Access configuration
print(f"Base model: {config.base_model_id}")
print(f"Epochs: {config.epochs}")
print(f"Learning rate: {config.learning_rate}")

# Run training
stats = tuner.run()

Programmatic Configuration¶

from app.config_manager import FineTunerConfig
from app.finetuner import FineTune

# Create configuration programmatically
config = FineTunerConfig(
    base_model_id="unsloth/Qwen2.5-0.5B-Instruct-bnb-4bit",
    training_data_id="your-username/your-dataset",
    epochs=3,
    learning_rate=0.0002,
    device_train_batch_size=4,
    rank=16,
    lora_alpha=16
)

# Use configuration
tuner = FineTune(config=config)
stats = tuner.run()

Error Handling¶

The API includes comprehensive error handling:

from app.finetuner import FineTune
from app.config_manager import ConfigManager

try:
    config_manager = ConfigManager("config.toml")
    tuner = FineTune(config_manager=config_manager)
    stats = tuner.run()

except FileNotFoundError as e:
    print(f"Configuration file not found: {e}")

except ValueError as e:
    print(f"Invalid configuration: {e}")

except RuntimeError as e:
    print(f"Training error: {e}")

except Exception as e:
    print(f"Unexpected error: {e}")

Type Hints¶

The API uses comprehensive type hints for better IDE support:

from typing import Dict, Any, Optional, Tuple
from transformers import PreTrainedTokenizerBase
from unsloth import FastLanguageModel

def load_model_and_tokenizer(
    model_id: str,
    max_seq_length: int = 4096,
    dtype: Optional[str] = None
) -> Tuple[FastLanguageModel, PreTrainedTokenizerBase]:
    """Load model and tokenizer with type safety."""
    pass

def process_dataset(
    dataset_id: str,
    config: Dict[str, Any]
) -> Dict[str, Any]:
    """Process dataset with configuration."""
    pass

Constants and Enums¶

Model Types¶

SUPPORTED_MODEL_TYPES = [
    "qwen",
    "llama", 
    "mistral",
    "gemma",
    "phi"
]

Optimizers¶

SUPPORTED_OPTIMIZERS = [
    "adamw_torch",
    "adamw_hf",
    "paged_adamw_8bit",
    "paged_adamw_32bit"
]

Schedulers¶

SUPPORTED_SCHEDULERS = [
    "linear",
    "cosine",
    "cosine_with_restarts",
    "polynomial",
    "constant",
    "constant_with_warmup"
]

Configuration Schema¶

Complete Configuration Example¶

{
    "fine_tuner": {
        # Model configuration
        "base_model_id": str,
        "max_sequence_length": int,
        "dtype": Optional[str],
        "load_in_4bit": bool,
        "load_in_8bit": bool,
        "full_finetuning": bool,

        # LoRA configuration
        "rank": int,
        "lora_alpha": int,
        "lora_dropout": float,
        "target_modules": List[str],
        "bias": str,
        "use_rslora": bool,
        "loftq_config": Optional[str],

        # Dataset configuration
        "training_data_id": str,
        "validation_data_id": Optional[str],
        "dataset_num_proc": int,
        "question_column": str,
        "ground_truth_column": str,
        "system_prompt_column": Optional[str],
        "system_prompt_override_text": Optional[str],

        # Training configuration
        "epochs": int,
        "learning_rate": float,
        "device_train_batch_size": int,
        "device_validation_batch_size": int,
        "grad_accumulation": int,
        "warmup_steps": int,
        "optimizer": str,
        "weight_decay": float,
        "lr_scheduler_type": str,
        "seed": int,

        # Logging configuration
        "log_steps": int,
        "log_first_step": bool,
        "save_steps": int,
        "save_total_limit": int,
        "push_to_hub": bool,
        "report_to": str,
        "wandb_project_name": str,

        # Advanced configuration
        "packing": bool,
        "use_gradient_checkpointing": Union[bool, str],
        "use_flash_attention": bool,
        "train_on_responses_only": bool,
        "question_part": str,
        "answer_part": str,

        # Run naming
        "run_name": Optional[str],
        "run_name_prefix": str,
        "run_name_suffix": str
    }
}

Environment Variables¶

The API recognizes these environment variables:

Variable	Purpose	Required
`HF_TOKEN`	Hugging Face authentication	Yes
`WANDB_TOKEN`	Weights & Biases API key	Yes
`OPENAI_API_KEY`	OpenAI API key for evaluation	Optional
`TRANSFORMERS_CACHE`	HuggingFace cache directory	No
`CUDA_VISIBLE_DEVICES`	GPU device selection	No

Performance Tips¶

Memory Optimization¶

# For low-memory systems
config = {
    "load_in_4bit": True,
    "device_train_batch_size": 1,
    "grad_accumulation": 16,
    "use_gradient_checkpointing": "unsloth",
    "max_sequence_length": 1024
}

Speed Optimization¶

# For faster training
config = {
    "packing": True,
    "use_flash_attention": True,
    "dataset_num_proc": 8,
    "dtype": None  # Auto-select best precision
}

Quality Optimization¶

# For better results
config = {
    "rank": 64,
    "lora_alpha": 32,
    "epochs": 5,
    "learning_rate": 0.0001,
    "validation_data_id": "your-validation-set"
}