L

Initializing Studio...

Documentation

Getting Started

  • Introduction
  • Quick Start
  • Installation

Fine-tuning

  • LoRA & QLoRA
  • Full Fine-tuning

API & SDK

  • REST API
  • Python SDK

Deployment

  • Cloud Deployment
  • Security

Resources

  • FAQ
  • Changelog

Evaluation

Understand model evaluation metrics and best practices for measuring performance.

Evaluation Metrics

LangTrain provides comprehensive evaluation tools including accuracy, F1-score, BLEU, ROUGE, and custom metrics for different tasks. Choose the right metrics for your specific use case.

Classification Metrics:
●Accuracy - Overall correctness

●Precision - True positive rate

●Recall - Sensitivity

●F1-Score - Harmonic mean of precision and recall

●ROC-AUC - Area under the curve


Text Generation Metrics:
●BLEU - Bilingual evaluation understudy

●ROUGE - Recall-oriented understudy for gisting evaluation

●METEOR - Metric for evaluation of translation with explicit ordering

●BERTScore - Semantic similarity using BERT embeddings
1from langtrain import Evaluator
2
3# Initialize evaluator
4evaluator = Evaluator(task_type='text_classification')
5
6# Built-in metrics
7results = evaluator.evaluate(
8 model=model,
9 test_data=test_dataset,
10 metrics=['accuracy', 'f1_score', 'precision', 'recall']
11)
12
13print(f"Accuracy: {results['accuracy']:.4f}")
14print(f"F1-Score: {results['f1_score']:.4f}")
15
16# For text generation tasks
17gen_evaluator = Evaluator(task_type='text_generation')
18gen_results = gen_evaluator.evaluate(
19 model=model,
20 test_data=test_dataset,
21 metrics=['bleu', 'rouge', 'bert_score']
22)

Custom Metrics

Define custom evaluation metrics tailored to your specific requirements and domain.
1# Define custom evaluation metric
2def custom_domain_accuracy(predictions, labels, domain_weights):
3 """Custom metric that weights accuracy by domain importance"""
4 correct = 0
5 total_weight = 0
6
7 for pred, label, weight in zip(predictions, labels, domain_weights):
8 if pred == label:
9 correct += weight
10 total_weight += weight
11
12 return correct / total_weight if total_weight > 0 else 0
13
14# Register custom metric
15evaluator.register_metric('domain_accuracy', custom_domain_accuracy)
16
17# Use in evaluation
18results = evaluator.evaluate(
19 model=model,
20 test_data=test_dataset,
21 metrics=['accuracy', 'domain_accuracy'],
22 metric_params={'domain_accuracy': {'domain_weights': weights}}
23)

Evaluation Strategies

Implement robust evaluation strategies including cross-validation, holdout testing, and temporal splits.
1# Cross-validation evaluation
2from langtrain.evaluation import CrossValidator
3
4cv = CrossValidator(
5 folds=5,
6 stratified=True,
7 random_state=42
8)
9
10cv_results = cv.evaluate(
11 model=model,
12 data=dataset,
13 metrics=['accuracy', 'f1_score']
14)
15
16print(f"CV Accuracy: {cv_results['accuracy'].mean():.4f} ± {cv_results['accuracy'].std():.4f}")
17
18# Temporal split for time-series data
19from langtrain.evaluation import TemporalSplit
20
21temporal_split = TemporalSplit(
22 train_size=0.7,
23 val_size=0.15,
24 test_size=0.15,
25 time_column='timestamp'
26)
27
28train, val, test = temporal_split.split(dataset)
29
30# Evaluate on temporal test set
31temporal_results = evaluator.evaluate(
32 model=model,
33 test_data=test,
34 metrics=['accuracy', 'f1_score']
35)

Model Comparison

Compare multiple models systematically with statistical significance testing.
1# Compare multiple models
2from langtrain.evaluation import ModelComparator
3
4comparator = ModelComparator(
5 models=[model1, model2, model3],
6 model_names=['BERT', 'RoBERTa', 'DistilBERT']
7)
8
9comparison_results = comparator.compare(
10 test_data=test_dataset,
11 metrics=['accuracy', 'f1_score', 'inference_time'],
12 statistical_test='mcnemar' # McNemar's test for significance
13)
14
15# Generate comparison report
16comparator.generate_report(
17 results=comparison_results,
18 output_path='model_comparison_report.html',
19 include_plots=True
20)
21
22print(comparison_results.summary())

Continuous Evaluation

Set up continuous evaluation pipelines to monitor model performance over time.
1# Continuous evaluation setup
2from langtrain.evaluation import ContinuousEvaluator
3
4continuous_eval = ContinuousEvaluator(
5 model=model,
6 evaluation_schedule='daily',
7 alert_thresholds={
8 'accuracy': 0.85, # Alert if accuracy drops below 85%
9 'f1_score': 0.80
10 }
11)
12
13# Monitor data drift
14continuous_eval.enable_drift_detection(
15 reference_data=training_data,
16 drift_threshold=0.1
17)
18
19# Set up alerts
20continuous_eval.configure_alerts(
21 email=['team@company.com'],
22 slack_webhook='https://hooks.slack.com/...'
23)
24
25# Start monitoring
26continuous_eval.start()

On this page

Evaluation MetricsCustom MetricsEvaluation StrategiesModel ComparisonContinuous Evaluation