Random Forest Algorithm

Ensemble Learning with Decision Trees

What is Random Forest?

Random Forest is an ensemble learning method that combines multiple decision trees to create a more robust and accurate model. It uses bagging (bootstrap aggregating) and random feature selection to reduce overfitting and improve generalization.

🌳 Think of it like asking multiple experts:

👨‍⚕️

Doctor A's Opinion

+
👩‍⚕️

Doctor B's Opinion

+
👨‍⚕️

Doctor C's Opinion

🎯

Final Diagnosis
(Majority Vote)

Just like consulting multiple doctors gives a more reliable diagnosis, Random Forest combines multiple decision trees for better predictions!

Key Concepts:

  • Ensemble Method: Combines predictions from multiple models
  • Bootstrap Sampling: Each tree trained on random subset of data
  • Random Feature Selection: Each split considers random subset of features
  • Voting/Averaging: Final prediction based on majority vote or average

How Random Forest Works

📊 Visual Workflow:

📋

Original Dataset

🔵🔴🔵🔴🔵
🔴🔵🔵🔴🔵
🔵🔴🔴🔵🔴
🎲

Bootstrap Sampling

Sample 1: 🔵🔴🔵🔵🔴
Sample 2: 🔴🔵🔴🔵🔵
Sample 3: 🔵🔵🔴🔴🔵
🌳

Train Decision Trees

Tree 1
Tree 2
Tree 3
...
Tree 100
🗳️

Combine Predictions

Tree 1: Class A
Tree 2: Class B
Tree 3: Class A
→ Final: Class A (Majority)

Algorithm Steps:

  1. Bootstrap Sampling: Create N bootstrap samples from training data
  2. Tree Training: Train decision tree on each bootstrap sample
  3. Random Features: At each split, consider only √p features (p = total features)
  4. Grow Trees: Grow trees to maximum depth (no pruning)
  5. Prediction: Combine predictions from all trees

Random Forest vs Decision Trees

🔍 Visual Comparison:

Single Decision Tree

Root
Left
Right
🔴
🔵
🔴
🔵
⚠️ High Overfitting Risk
VS

Random Forest

🌳
🌳
🌳
🌳
🌳
🔴🔵🔴🔴🔵
→ Final: 🔴 (3 votes)
✅ Lower Overfitting Risk
Aspect Decision Tree Random Forest
Overfitting High tendency Reduced overfitting
Accuracy Good on training data Better generalization
Interpretability High (visual tree) Lower (black box)
Training Time Fast Slower (multiple trees)
Variance High Low (averaging effect)

Feature Importance

Random Forest provides feature importance scores by measuring how much each feature decreases impurity across all trees.

📊 Feature Importance Example:

Age
0.35
Income
0.28
Education
0.20
Location
0.12
Gender
0.05

📈 Higher bars = More important features for prediction

Calculation Method:

  • For each tree, calculate impurity decrease for each feature
  • Average across all trees in the forest
  • Normalize to get relative importance scores

Python Example:

from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get feature importance
importance = rf.feature_importances_
feature_names = X_train.columns

# Create importance DataFrame
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importance
}).sort_values('importance', ascending=False)

print(importance_df)

Classification vs Regression

Random Forest Classifier

  • Prediction: Majority voting from all trees
  • Output: Class labels or probabilities
  • Use Cases: Email spam detection, image classification, medical diagnosis

Random Forest Regressor

  • Prediction: Average of all tree predictions
  • Output: Continuous numerical values
  • Use Cases: House price prediction, stock prices, weather forecasting

Scikit-Learn Implementation

🚀 Interactive Python Examples:

Click the button below to open the complete Jupyter notebook with all examples:

Includes: Classification, Regression, Feature Importance, Hyperparameter Tuning

Preview - Classification Example:

# Quick Preview - Full code in notebook
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)

# Click notebook button above for complete implementation!

Benefits of Random Forest

🎯 High Accuracy

Ensemble approach typically outperforms single decision trees

🛡️ Overfitting Resistance

Averaging multiple trees reduces overfitting significantly

📊 Feature Importance

Provides built-in feature importance ranking

🔧 Handles Missing Values

Can handle missing values and mixed data types

⚡ Parallel Training

Trees can be trained independently in parallel

🎛️ Few Hyperparameters

Works well with default parameters, minimal tuning needed

Key Parameters

n_estimators

Number of trees in the forest. More trees = better performance but slower training.

Default: 100

max_depth

Maximum depth of trees. Controls overfitting.

Default: None (unlimited)

min_samples_split

Minimum samples required to split internal node.

Default: 2

max_features

Number of features to consider for best split.

Default: √(total features)

🚀 Hands-on Practice

📊 Classification Example

Iris dataset classification with feature importance

📈 Regression Example

House price prediction with Random Forest

🔧 Hyperparameter Tuning

Optimize Random Forest performance

⚖️ Model Comparison

Compare Decision Tree vs Random Forest

👨‍🏫 For Teachers:

All examples are ready-to-run in Jupyter notebook. Students can modify parameters and see immediate results. Perfect for interactive teaching!