1. Confusion Matrix Fundamentals
The confusion matrix is the foundation for understanding classification metrics. It shows how well your model distinguishes between different classes by comparing predicted vs actual labels.
Understanding the Components:
True Positives (TP): Correctly predicted positive cases
True Negatives (TN): Correctly predicted negative cases
False Positives (FP): Incorrectly predicted as positive (Type I error)
False Negatives (FN): Incorrectly predicted as negative (Type II error)
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
Run Code
2. Accuracy
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Measures overall correctness of predictions
When to Use Accuracy:
✅ Good for: Balanced datasets where all classes are equally important
❌ Avoid when: Dataset is imbalanced (e.g., 95% negative, 5% positive)
Example: In a balanced email classification (50% spam, 50% not spam), 85% accuracy means the model correctly classifies 85 out of 100 emails.
from sklearn.metrics import accuracy_score
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Manual calculation
TP = cm[1,1] # True Positives
TN = cm[0,0] # True Negatives
FP = cm[0,1] # False Positives
FN = cm[1,0] # False Negatives
manual_accuracy = (TP + TN) / (TP + TN + FP + FN)
print(f"Manual Accuracy: {manual_accuracy:.4f}")
Run Code
3. Precision
Precision = TP / (TP + FP)
Of all positive predictions, how many were actually positive?
When Precision Matters:
High precision needed when: False positives are costly
Examples:
• Spam detection: Don't want important emails marked as spam
• Medical diagnosis: Avoid unnecessary treatments
• Fraud detection: Don't block legitimate transactions
Trade-off: Higher precision often means lower recall
from sklearn.metrics import precision_score
# Calculate precision
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision:.4f}")
# Manual calculation
manual_precision = TP / (TP + FP) if (TP + FP) > 0 else 0
print(f"Manual Precision: {manual_precision:.4f}")
Run Code
4. Recall (Sensitivity)
Recall = TP / (TP + FN)
Of all actual positives, how many did we correctly identify?
When Recall Matters:
High recall needed when: False negatives are costly
Examples:
• Cancer screening: Don't miss any cancer cases
• Security systems: Detect all threats
• Search engines: Find all relevant results
Trade-off: Higher recall often means lower precision
Also called: Sensitivity, True Positive Rate
from sklearn.metrics import recall_score
# Calculate recall
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall:.4f}")
# Manual calculation
manual_recall = TP / (TP + FN) if (TP + FN) > 0 else 0
print(f"Manual Recall: {manual_recall:.4f}")
Run Code
5. F1-Score
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall
Why F1-Score is Important:
Balances precision and recall: Single metric that considers both
Harmonic mean: Punishes extreme values more than arithmetic mean
Best for: Imbalanced datasets where you need both precision and recall
Range: 0 to 1 (higher is better)
When F1 = 1: Perfect precision and recall
When F1 = 0: Either precision or recall is 0
from sklearn.metrics import f1_score
# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print(f"F1-Score: {f1:.4f}")
# Manual calculation
manual_f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
print(f"Manual F1-Score: {manual_f1:.4f}")
Run Code
6. R² Score (Coefficient of Determination)
R² = 1 - (SS_res / SS_tot)
For regression: proportion of variance explained by the model
Understanding R² Score:
Range: -∞ to 1 (higher is better)
R² = 1: Perfect model (explains 100% of variance)
R² = 0: Model performs as well as predicting the mean
R² < 0: Model performs worse than predicting the mean
SS_res: Sum of squares of residuals (prediction errors)
SS_tot: Total sum of squares (variance in data)
Interpretation: R² = 0.85 means model explains 85% of data variance
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.datasets import make_regression
# Generate regression data
X_reg, y_reg = make_regression(n_samples=1000, n_features=1, noise=10, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)
# Train regression model
reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)
y_pred_reg = reg_model.predict(X_test_reg)
# Calculate R² score
r2 = r2_score(y_test_reg, y_pred_reg)
print(f"R² Score: {r2:.4f}")
# Manual calculation
ss_res = np.sum((y_test_reg - y_pred_reg) ** 2)
ss_tot = np.sum((y_test_reg - np.mean(y_test_reg)) ** 2)
manual_r2 = 1 - (ss_res / ss_tot)
print(f"Manual R² Score: {manual_r2:.4f}")
Run Code
7. Complete Classification Report
The classification report provides a comprehensive view of your model's performance across all metrics and classes.
Report Components:
Per-class metrics: Precision, recall, F1-score for each class
Support: Number of actual instances of each class
Macro avg: Unweighted mean (treats all classes equally)
Weighted avg: Weighted by support (accounts for class imbalance)
Overall accuracy: Total correct predictions / total predictions
# Generate comprehensive report
report = classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1'])
print("Classification Report:")
print(report)
# Summary of all metrics
print("\n=== METRIC SUMMARY ===")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"True Positives: {TP}")
print(f"True Negatives: {TN}")
print(f"False Positives: {FP}")
print(f"False Negatives: {FN}")
Run Code
8. Interactive Metric Calculator
Practice calculating metrics with different confusion matrix values. Try different scenarios to see how metrics change:
Try These Scenarios:
High Precision, Low Recall: TP=40, FP=5, TN=45, FN=20
High Recall, Low Precision: TP=55, FP=25, TN=15, FN=5
Balanced Performance: TP=42, FP=8, TN=42, FN=8
Poor Performance: TP=20, FP=30, TN=20, FN=30
Enter confusion matrix values to calculate all metrics:
Calculate All Metrics