Random Forest Algorithm
Ensemble Learning with Decision Trees
What is Random Forest?
Random Forest is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model. It uses bagging (bootstrap aggregating) and random feature selection to reduce overfitting and improve generalization.
🌳 Think of it like asking multiple experts:
Doctor A's Opinion
Doctor B's Opinion
Doctor C's Opinion
Final Diagnosis
(Majority Vote)
Just like consulting multiple doctors gives a more reliable diagnosis, Random Forest combines multiple decision trees for better predictions!
Key Concepts:
- Ensemble Method: Combines predictions from multiple models
- Bootstrap Sampling: Each tree trained on random subset of data
- Random Feature Selection: Each split considers random subset of features
- Voting/Averaging: Final prediction based on majority vote or average
How Random Forest Works
📊 Visual Workflow:
Original Dataset
Bootstrap Sampling
Train Decision Trees
Combine Predictions
Algorithm Steps:
- Bootstrap Sampling: Create N bootstrap samples from training data
- Tree Training: Train decision tree on each bootstrap sample
- Random Features: At each split, consider only √p features (p = total features)
- Grow Trees: Grow trees to maximum depth (no pruning)
- Prediction: Combine predictions from all trees
Random Forest vs Decision Trees
🔍 Visual Comparison:
Single Decision Tree
Random Forest
| Aspect | Decision Tree | Random Forest |
|---|---|---|
| Overfitting | High tendency | Reduced overfitting |
| Accuracy | Good on training data | Better generalization |
| Interpretability | High (visual tree) | Lower (black box) |
| Training Time | Fast | Slower (multiple trees) |
| Variance | High | Low (averaging effect) |
Feature Importance
Random Forest provides feature importance scores by measuring how much each feature decreases impurity across all trees.
📊 Feature Importance Example:
📈 Higher bars = More important features for prediction
Calculation Method:
- For each tree, calculate impurity decrease for each feature
- Average across all trees in the forest
- Normalize to get relative importance scores
Python Example:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Get feature importance
importance = rf.feature_importances_
feature_names = X_train.columns
# Create importance DataFrame
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': importance
}).sort_values('importance', ascending=False)
print(importance_df)
Classification vs Regression
Random Forest Classifier
- Prediction: Majority voting from all trees
- Output: Class labels or probabilities
- Use Cases: Email spam detection, image classification, medical diagnosis
Random Forest Regressor
- Prediction: Average of all tree predictions
- Output: Continuous numerical values
- Use Cases: House price prediction, stock prices, weather forecasting
Scikit-Learn Implementation
🚀 Interactive Python Examples:
Click the button below to open the complete Jupyter notebook with all examples:
Includes: Classification, Regression, Feature Importance, Hyperparameter Tuning
Preview - Classification Example:
# Quick Preview - Full code in notebook
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
# Click notebook button above for complete implementation!
Benefits of Random Forest
🎯 High Accuracy
Ensemble approach typically outperforms single decision trees
🛡️ Overfitting Resistance
Averaging multiple trees reduces overfitting significantly
📊 Feature Importance
Provides built-in feature importance ranking
🔧 Handles Missing Values
Can handle missing values and mixed data types
⚡ Parallel Training
Trees can be trained independently in parallel
🎛️ Few Hyperparameters
Works well with default parameters, minimal tuning needed
Key Parameters
n_estimators
Number of trees in the forest. More trees = better performance but slower training.
Default: 100max_depth
Maximum depth of trees. Controls overfitting.
Default: None (unlimited)min_samples_split
Minimum samples required to split internal node.
Default: 2max_features
Number of features to consider for best split.
Default: √(total features)🚀 Hands-on Practice
📊 Classification Example
Iris dataset classification with feature importance
📈 Regression Example
House price prediction with Random Forest
🔧 Hyperparameter Tuning
Optimize Random Forest performance
⚖️ Model Comparison
Compare Decision Tree vs Random Forest
👨🏫 For Teachers:
All examples are ready-to-run in Jupyter notebook. Students can modify parameters and see immediate results. Perfect for interactive teaching!