ML Evaluation Metrics - Complete Guide

1. Confusion Matrix Fundamentals

The confusion matrix is the foundation for understanding classification metrics. It shows how well your model distinguishes between different classes by comparing predicted vs actual labels.

Predicted

Positive

Negative

Actual Positive

TP

FN

Actual Negative

FP

TN

Understanding the Components:

True Positives (TP): Correctly predicted positive cases

True Negatives (TN): Correctly predicted negative cases

False Positives (FP): Incorrectly predicted as positive (Type I error)

False Negatives (FN): Incorrectly predicted as negative (Type II error)

import numpy as np

from sklearn.metrics import confusion_matrix, classification_report

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import make_classification

# Generate sample data

X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model

model = RandomForestClassifier(random_state=42)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Create confusion matrix

cm = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")

print(cm)

2. Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Measures overall correctness of predictions

When to Use Accuracy:

✅ Good for: Balanced datasets where all classes are equally important

❌ Avoid when: Dataset is imbalanced (e.g., 95% negative, 5% positive)

Example: In a balanced email classification (50% spam, 50% not spam), 85% accuracy means the model correctly classifies 85 out of 100 emails.

from sklearn.metrics import accuracy_score

# Calculate accuracy

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")

# Manual calculation

TP = cm[1,1]  # True Positives

TN = cm[0,0]  # True Negatives

FP = cm[0,1]  # False Positives

FN = cm[1,0]  # False Negatives

manual_accuracy = (TP + TN) / (TP + TN + FP + FN)

print(f"Manual Accuracy: {manual_accuracy:.4f}")

3. Precision

Precision = TP / (TP + FP)
Of all positive predictions, how many were actually positive?

When Precision Matters:

High precision needed when: False positives are costly

Examples:

• Spam detection: Don't want important emails marked as spam

• Medical diagnosis: Avoid unnecessary treatments

• Fraud detection: Don't block legitimate transactions

Trade-off: Higher precision often means lower recall

from sklearn.metrics import precision_score

# Calculate precision

precision = precision_score(y_test, y_pred)

print(f"Precision: {precision:.4f}")

# Manual calculation

manual_precision = TP / (TP + FP) if (TP + FP) > 0 else 0

print(f"Manual Precision: {manual_precision:.4f}")

4. Recall (Sensitivity)

Recall = TP / (TP + FN)
Of all actual positives, how many did we correctly identify?

When Recall Matters:

High recall needed when: False negatives are costly

Examples:

• Cancer screening: Don't miss any cancer cases

• Security systems: Detect all threats

• Search engines: Find all relevant results

Trade-off: Higher recall often means lower precision

Also called: Sensitivity, True Positive Rate

from sklearn.metrics import recall_score

# Calculate recall

recall = recall_score(y_test, y_pred)

print(f"Recall: {recall:.4f}")

# Manual calculation

manual_recall = TP / (TP + FN) if (TP + FN) > 0 else 0

print(f"Manual Recall: {manual_recall:.4f}")

5. F1-Score

F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall

Why F1-Score is Important:

Balances precision and recall: Single metric that considers both

Harmonic mean: Punishes extreme values more than arithmetic mean

Best for: Imbalanced datasets where you need both precision and recall

Range: 0 to 1 (higher is better)

When F1 = 1: Perfect precision and recall

When F1 = 0: Either precision or recall is 0

from sklearn.metrics import f1_score

# Calculate F1-score

f1 = f1_score(y_test, y_pred)

print(f"F1-Score: {f1:.4f}")

# Manual calculation

manual_f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"Manual F1-Score: {manual_f1:.4f}")

6. R² Score (Coefficient of Determination)

R² = 1 - (SS_res / SS_tot)
For regression: proportion of variance explained by the model

Understanding R² Score:

Range: -∞ to 1 (higher is better)

R² = 1: Perfect model (explains 100% of variance)

R² = 0: Model performs as well as predicting the mean

R² < 0: Model performs worse than predicting the mean

SS_res: Sum of squares of residuals (prediction errors)

SS_tot: Total sum of squares (variance in data)

Interpretation: R² = 0.85 means model explains 85% of data variance

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score, mean_squared_error

from sklearn.datasets import make_regression

# Generate regression data

X_reg, y_reg = make_regression(n_samples=1000, n_features=1, noise=10, random_state=42)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

# Train regression model

reg_model = LinearRegression()

reg_model.fit(X_train_reg, y_train_reg)

y_pred_reg = reg_model.predict(X_test_reg)

# Calculate R² score

r2 = r2_score(y_test_reg, y_pred_reg)

print(f"R² Score: {r2:.4f}")

# Manual calculation

ss_res = np.sum((y_test_reg - y_pred_reg) ** 2)

ss_tot = np.sum((y_test_reg - np.mean(y_test_reg)) ** 2)

manual_r2 = 1 - (ss_res / ss_tot)

print(f"Manual R² Score: {manual_r2:.4f}")

7. Complete Classification Report

The classification report provides a comprehensive view of your model's performance across all metrics and classes.

Report Components:

Per-class metrics: Precision, recall, F1-score for each class

Support: Number of actual instances of each class

Macro avg: Unweighted mean (treats all classes equally)

Weighted avg: Weighted by support (accounts for class imbalance)

Overall accuracy: Total correct predictions / total predictions

# Generate comprehensive report

report = classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1'])

print("Classification Report:")

print(report)

# Summary of all metrics

print("\n=== METRIC SUMMARY ===")

print(f"Accuracy:  {accuracy:.4f}")

print(f"Precision: {precision:.4f}")

print(f"Recall:    {recall:.4f}")

print(f"F1-Score:  {f1:.4f}")

print(f"True Positives:  {TP}")

print(f"True Negatives:  {TN}")

print(f"False Positives: {FP}")

print(f"False Negatives: {FN}")

8. Interactive Metric Calculator

Practice calculating metrics with different confusion matrix values. Try different scenarios to see how metrics change:

Try These Scenarios:

High Precision, Low Recall: TP=40, FP=5, TN=45, FN=20

High Recall, Low Precision: TP=55, FP=25, TN=15, FN=5

Balanced Performance: TP=42, FP=8, TN=42, FN=8

Poor Performance: TP=20, FP=30, TN=20, FN=30

Enter confusion matrix values to calculate all metrics:

True Positives (TP):

False Positives (FP):

True Negatives (TN):

False Negatives (FN):

Machine Learning Evaluation Metrics

Complete Guide with Interactive Examples