Machine Learning Evaluation Metrics

Complete Guide with Interactive Examples

1. Confusion Matrix Fundamentals

The confusion matrix is the foundation for understanding classification metrics. It shows how well your model distinguishes between different classes by comparing predicted vs actual labels.

Predicted
Positive
Negative
Actual Positive
TP
FN
Actual Negative
FP
TN
Understanding the Components:

True Positives (TP): Correctly predicted positive cases

True Negatives (TN): Correctly predicted negative cases

False Positives (FP): Incorrectly predicted as positive (Type I error)

False Negatives (FN): Incorrectly predicted as negative (Type II error)

import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

2. Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Measures overall correctness of predictions
When to Use Accuracy:

✅ Good for: Balanced datasets where all classes are equally important

❌ Avoid when: Dataset is imbalanced (e.g., 95% negative, 5% positive)

Example: In a balanced email classification (50% spam, 50% not spam), 85% accuracy means the model correctly classifies 85 out of 100 emails.

from sklearn.metrics import accuracy_score

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Manual calculation
TP = cm[1,1] # True Positives
TN = cm[0,0] # True Negatives
FP = cm[0,1] # False Positives
FN = cm[1,0] # False Negatives

manual_accuracy = (TP + TN) / (TP + TN + FP + FN)
print(f"Manual Accuracy: {manual_accuracy:.4f}")

3. Precision

Precision = TP / (TP + FP)
Of all positive predictions, how many were actually positive?
When Precision Matters:

High precision needed when: False positives are costly

Examples:

• Spam detection: Don't want important emails marked as spam

• Medical diagnosis: Avoid unnecessary treatments

• Fraud detection: Don't block legitimate transactions

Trade-off: Higher precision often means lower recall

from sklearn.metrics import precision_score

# Calculate precision
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")

# Manual calculation
manual_precision = TP / (TP + FP) if (TP + FP) > 0 else 0
print(f"Manual Precision: {manual_precision:.4f}")

4. Recall (Sensitivity)

Recall = TP / (TP + FN)
Of all actual positives, how many did we correctly identify?
When Recall Matters:

High recall needed when: False negatives are costly

Examples:

• Cancer screening: Don't miss any cancer cases

• Security systems: Detect all threats

• Search engines: Find all relevant results

Trade-off: Higher recall often means lower precision

Also called: Sensitivity, True Positive Rate

from sklearn.metrics import recall_score

# Calculate recall
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.4f}")

# Manual calculation
manual_recall = TP / (TP + FN) if (TP + FN) > 0 else 0
print(f"Manual Recall: {manual_recall:.4f}")

5. F1-Score

F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall
Why F1-Score is Important:

Balances precision and recall: Single metric that considers both

Harmonic mean: Punishes extreme values more than arithmetic mean

Best for: Imbalanced datasets where you need both precision and recall

Range: 0 to 1 (higher is better)

When F1 = 1: Perfect precision and recall

When F1 = 0: Either precision or recall is 0

from sklearn.metrics import f1_score

# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print(f"F1-Score: {f1:.4f}")

# Manual calculation
manual_f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
print(f"Manual F1-Score: {manual_f1:.4f}")

6. R² Score (Coefficient of Determination)

R² = 1 - (SS_res / SS_tot)
For regression: proportion of variance explained by the model
Understanding R² Score:

Range: -∞ to 1 (higher is better)

R² = 1: Perfect model (explains 100% of variance)

R² = 0: Model performs as well as predicting the mean

R² < 0: Model performs worse than predicting the mean

SS_res: Sum of squares of residuals (prediction errors)

SS_tot: Total sum of squares (variance in data)

Interpretation: R² = 0.85 means model explains 85% of data variance

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.datasets import make_regression

# Generate regression data
X_reg, y_reg = make_regression(n_samples=1000, n_features=1, noise=10, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)

# Train regression model
reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)
y_pred_reg = reg_model.predict(X_test_reg)

# Calculate R² score
r2 = r2_score(y_test_reg, y_pred_reg)
print(f"R² Score: {r2:.4f}")

# Manual calculation
ss_res = np.sum((y_test_reg - y_pred_reg) ** 2)
ss_tot = np.sum((y_test_reg - np.mean(y_test_reg)) ** 2)
manual_r2 = 1 - (ss_res / ss_tot)
print(f"Manual R² Score: {manual_r2:.4f}")

7. Complete Classification Report

The classification report provides a comprehensive view of your model's performance across all metrics and classes.

Report Components:

Per-class metrics: Precision, recall, F1-score for each class

Support: Number of actual instances of each class

Macro avg: Unweighted mean (treats all classes equally)

Weighted avg: Weighted by support (accounts for class imbalance)

Overall accuracy: Total correct predictions / total predictions

# Generate comprehensive report
report = classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1'])
print("Classification Report:")
print(report)

# Summary of all metrics
print("\n=== METRIC SUMMARY ===")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"True Positives: {TP}")
print(f"True Negatives: {TN}")
print(f"False Positives: {FP}")
print(f"False Negatives: {FN}")

8. Interactive Metric Calculator

Practice calculating metrics with different confusion matrix values. Try different scenarios to see how metrics change:

Try These Scenarios:

High Precision, Low Recall: TP=40, FP=5, TN=45, FN=20

High Recall, Low Precision: TP=55, FP=25, TN=15, FN=5

Balanced Performance: TP=42, FP=8, TN=42, FN=8

Poor Performance: TP=20, FP=30, TN=20, FN=30

Enter confusion matrix values to calculate all metrics: