Models Chat

LoRA & QLoRA

Parameter-efficient fine-tuning with Low-Rank Adaptation and Quantized LoRA.

PyTorch

Hugging Face

NVIDIA Quantum

What is LoRA?

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture.

LoRA significantly reduces the number of trainable parameters for downstream tasks. For a pre-trained model with weight matrix W₀, LoRA represents the weight update ΔW as the product of two low-rank matrices A and B: ΔW = BA.

During training, W₀ remains frozen and only A and B are updated. The modified forward pass becomes: h = W₀x + ΔWx = W₀x + BAx.

1# Basic LoRA implementation concept
2import torch
3import torch.nn as nn
4
5class LoRALinear(nn.Module):
6    def __init__(self, in_features, out_features, rank=4, alpha=1):
7        super().__init__()
8        self.rank = rank
9        self.alpha = alpha
10        
11        # Frozen pre-trained weights
12        self.weight = nn.Parameter(torch.randn(out_features, in_features))
13        self.weight.requires_grad = False
14        
15        # LoRA matrices
16        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
17        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
18        
19        # Initialize A with gaussian, B with zeros
20        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
21        nn.init.zeros_(self.lora_B)
22    
23    def forward(self, x):
24        # Original forward pass + LoRA adaptation
25        result = F.linear(x, self.weight)
26        lora_result = F.linear(F.linear(x, self.lora_A.T), self.lora_B.T)
27        return result + (self.alpha / self.rank) * lora_result

QLoRA: Quantized LoRA

QLoRA (Quantized LoRA) extends LoRA by quantizing the pre-trained model to 4-bit precision while keeping LoRA adapters in 16-bit. This enables fine-tuning of massive models on consumer hardware.

Key innovations in QLoRA:

•4-bit NormalFloat (NF4): Information-theoretically optimal quantization for normally distributed weights
•Double Quantization: Quantize the quantization constants to save additional memory
•Paged Optimizers: Handle memory spikes during gradient computation

QLoRA makes it possible to fine-tune a 65B parameter model on a single 48GB GPU, democratizing access to large language model customization.

1# QLoRA fine-tuning with LangTrain
2from langtrain import QLoRATrainer
3from langtrain.models import AutoModelForCausalLM
4from langtrain.datasets import load_dataset
5from transformers import AutoTokenizer
6
7# Load model with 4-bit quantization
8model = AutoModelForCausalLM.from_pretrained(
9    "meta-llama/Llama-3.1-8B",
10    load_in_4bit=True,
11    bnb_4bit_compute_dtype=torch.float16,
12    bnb_4bit_quant_type="nf4",
13    bnb_4bit_use_double_quant=True,
14)
15
16tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
17
18# Configure QLoRA parameters
19qlora_config = {
20    "r": 64,  # Rank
21    "lora_alpha": 16,  # Scaling parameter
22    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
23    "lora_dropout": 0.1,
24    "bias": "none",
25    "task_type": "CAUSAL_LM"
26}
27
28# Load and prepare dataset
29dataset = load_dataset("your_dataset.jsonl")
30dataset = dataset.map(lambda x: tokenizer(x["text"], truncation=True, padding=True))
31
32# Initialize trainer
33trainer = QLoRATrainer(
34    model=model,
35    tokenizer=tokenizer,
36    dataset=dataset,
37    qlora_config=qlora_config,
38    output_dir="./qlora_results",
39    num_train_epochs=3,
40    per_device_train_batch_size=4,
41    gradient_accumulation_steps=4,
42    learning_rate=2e-4,
43    fp16=True,
44    save_steps=500,
45    logging_steps=10,
46)
47
48# Start training
49trainer.train()

Advanced Configuration

Fine-tune your LoRA/QLoRA setup with advanced parameters for optimal performance. The choice of rank (r), alpha, and target modules significantly impacts model quality and training efficiency.

Rank Selection: Higher ranks capture more information but increase parameters. Start with r=8-16 for most tasks, use r=64+ for complex domains.

Alpha Scaling: Controls the magnitude of LoRA updates. Use alpha=2*r as a starting point, then adjust based on validation performance.

Target Modules: Apply LoRA to attention layers (q_proj, v_proj) for most tasks. Include MLP layers for domain-specific knowledge.

1# Advanced LoRA configuration
2advanced_config = {
3    # Core LoRA parameters
4    "r": 32,  # Rank - balance between efficiency and capacity
5    "lora_alpha": 64,  # Scaling factor (typically 2*r)
6    "lora_dropout": 0.05,  # Regularization
7    
8    # Target modules - customize based on model architecture
9    "target_modules": [
10        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
11        "gate_proj", "up_proj", "down_proj"      # MLP (for Llama-like models)
12    ],
13    
14    # Advanced options
15    "bias": "lora_only",  # Train bias in LoRA layers
16    "modules_to_save": ["embed_tokens", "lm_head"],  # Additional modules
17    "init_lora_weights": True,  # Proper initialization
18    
19    # QLoRA specific
20    "load_in_4bit": True,
21    "bnb_4bit_compute_dtype": torch.bfloat16,
22    "bnb_4bit_quant_type": "nf4",
23    "bnb_4bit_use_double_quant": True,
24}
25
26# Training hyperparameters
27training_args = {
28    "output_dir": "./advanced_lora_results",
29    "num_train_epochs": 5,
30    "per_device_train_batch_size": 2,
31    "gradient_accumulation_steps": 8,
32    "learning_rate": 1e-4,
33    "weight_decay": 0.01,
34    "warmup_ratio": 0.03,
35    "lr_scheduler_type": "cosine",
36    "save_strategy": "steps",
37    "save_steps": 250,
38    "eval_strategy": "steps",
39    "eval_steps": 250,
40    "logging_steps": 10,
41    "fp16": False,
42    "bf16": True,  # Better numerical stability
43    "dataloader_pin_memory": False,  # Memory optimization
44    "remove_unused_columns": False,
45}

Merging and Deployment

After training, you can merge LoRA adapters back into the base model for simplified deployment, or keep them separate for flexibility. Merged models have no inference overhead, while separate adapters allow easy swapping between different fine-tuned versions.

Merging Benefits: Single model file, no additional inference code, optimal for production.

Separate Adapters: Multiple task-specific adapters, easy A/B testing, smaller storage requirements.

1# Method 1: Merge LoRA adapters into base model
2from peft import PeftModel
3import torch
4
5# Load base model and adapter
6base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
7model = PeftModel.from_pretrained(base_model, "./lora_results")
8
9# Merge adapters
10merged_model = model.merge_and_unload()
11
12# Save merged model
13merged_model.save_pretrained("./merged_model")
14tokenizer.save_pretrained("./merged_model")
15
16# Method 2: Deploy with separate adapters
17from langtrain import LoRAInference
18
19# Initialize inference engine
20inference = LoRAInference(
21    base_model="meta-llama/Llama-3.1-8B",
22    adapter_path="./lora_results",
23    device="cuda",
24    torch_dtype=torch.float16
25)
26
27# Switch between different adapters dynamically  
28inference.load_adapter("task_1", "./task1_lora")
29inference.load_adapter("task_2", "./task2_lora")
30
31# Generate with specific adapter
32response = inference.generate(
33    "Hello, how are you?",
34    adapter_name="task_1",
35    max_length=100,
36    temperature=0.7
37)
38
39print(response)

Performance Optimization

Optimize your LoRA/QLoRA training for maximum performance and efficiency. Key strategies include gradient checkpointing, mixed precision training, and optimal batch sizing.

Memory Optimization: Use gradient checkpointing to trade compute for memory. Enable gradient_checkpointing=True for 40-50% memory reduction.

Speed Optimization: Use bf16 instead of fp16 for numerical stability. Increase batch size with gradient accumulation for better GPU utilization.

1# Production-optimized training configuration
2from langtrain import OptimizedQLoRATrainer
3import torch
4
5# Memory-efficient configuration
6optimizer_config = {
7    # Optimizer settings
8    "optimizer": "adamw_torch_fused",  # Faster fused optimizer
9    "learning_rate": 2e-4,
10    "weight_decay": 0.01,
11    "adam_beta1": 0.9,
12    "adam_beta2": 0.999,
13    "adam_epsilon": 1e-8,
14    
15    # Memory optimizations
16    "gradient_checkpointing": True,
17    "dataloader_pin_memory": False,
18    "dataloader_num_workers": 4,
19    "remove_unused_columns": False,
20    
21    # Performance optimizations
22    "bf16": True,  # Better than fp16 for stability
23    "tf32": True,  # Enable TensorFloat-32 on A100
24    "ddp_find_unused_parameters": False,
25    
26    # Batch size optimization
27    "per_device_train_batch_size": 1,
28    "gradient_accumulation_steps": 16,  # Effective batch size = 16
29    "max_grad_norm": 1.0,
30}
31
32# Initialize optimized trainer
33trainer = OptimizedQLoRATrainer(
34    model=model,
35    tokenizer=tokenizer,
36    dataset=dataset,
37    **optimizer_config
38)
39
40# Train with automatic mixed precision
41with torch.cuda.amp.autocast():
42    trainer.train()
43
44# Profile memory usage
45print(f"Peak memory: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

000

Initializing Studio...