Decision Trees - 3 Minute ML Lecture

🤔 What is a Decision Tree?

A flowchart-like structure that makes decisions by asking yes/no questions about data features.

📚 Detailed Explanation:

Decision Trees are supervised learning algorithms that work like a series of if-else statements. They split data based on feature values to create a tree-like model of decisions.

How it thinks: "If age is greater than 25 AND income is above 50k, then approve the loan. Otherwise, reject it."

Real-world analogy: Like a doctor diagnosing a patient - they ask questions (symptoms) and follow a decision path to reach a diagnosis.

Example: Loan Approval System

Age > 25?
├── Yes → Income > 50k?
│ ├── Yes → Approve Loan ✅
│ └── No → Reject Loan ❌
└── No → Reject Loan ❌

Key Components:

Root Node: The top decision (Age > 25?)
Internal Nodes: Decision points (Income > 50k?)
Leaf Nodes: Final outcomes (Approve/Reject)
Branches: Connections showing decision paths

💻 Quick Implementation

📖 Step-by-Step Code Explanation:

This example shows how to create a simple loan approval system using decision trees.

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import make_classification

import numpy as np

# Sample data: [Age, Income] → Loan Approval

# Each row represents one loan applicant

X = np.array([[25, 50000], [35, 60000], [22, 30000], 

              [45, 80000], [28, 45000]])

y = np.array([0, 1, 0, 1, 0])  # 0=Reject, 1=Approve

# Create and train model

# max_depth=3 prevents overfitting by limiting tree depth

tree = DecisionTreeClassifier(max_depth=3)

tree.fit(X, y)

# Predict for new applicant: 30 years, $55k

prediction = tree.predict([[30, 55000]])

print(f"Loan decision: {'Approved' if prediction[0] else 'Rejected'}")

🔍 Code Breakdown:

Line 1-3: Import necessary libraries for machine learning
Line 6-8: Create training data with age and income features
Line 9: Define target labels (0 = reject, 1 = approve)
Line 12-13: Create and train the decision tree model
Line 16-17: Make prediction for new data point

Expected Output: The model will analyze the 30-year-old with $55k income and decide whether to approve or reject the loan based on patterns learned from training data.

🎯 Real Example: Email Spam Detection

📧 Practical Application Explained:

This example demonstrates how decision trees can classify emails as spam or normal based on email characteristics.

# Email features: [word_count, has_links, exclamation_marks]

# Each email is represented by these three numerical features

emails = np.array([

    [50, 0, 1],    # Normal email: 50 words, 0 links, 1 exclamation

    [200, 5, 8],   # Spam email: 200 words, 5 links, 8 exclamations

    [30, 1, 0],    # Normal email: 30 words, 1 link, 0 exclamations

    [150, 3, 12],  # Spam email: 150 words, 3 links, 12 exclamations

    [80, 1, 2]     # Normal email: 80 words, 1 link, 2 exclamations

])

labels = np.array([0, 1, 0, 1, 0])  # 0=Normal, 1=Spam

# Train spam detector

spam_tree = DecisionTreeClassifier()

spam_tree.fit(emails, labels)

# Check new email: 100 words, 2 links, 5 exclamations

new_email = [[100, 2, 5]]

result = spam_tree.predict(new_email)

print(f"Email classification: {'SPAM' if result[0] else 'NORMAL'}")

🧠 How the Algorithm Learns:

The decision tree analyzes patterns in the training data:

Pattern 1: Emails with many exclamation marks (>5) tend to be spam
Pattern 2: Emails with multiple links (>2) are often spam
Pattern 3: Very long emails (>150 words) with links are suspicious

Decision Process: For the new email [100, 2, 5], the tree might ask: "Does it have >4 exclamation marks? Yes → Check links... 2 links → Likely SPAM"

⚡ Key Concepts

🔑 Important Terms:

Root Node: Starting point - the first question asked
Leaf Node: Final decision - no more questions
Split: Decision point where data is divided
Depth: Number of levels in the tree
Pruning: Removing branches to prevent overfitting
Gini Impurity: Measure of how mixed the classes are

📊 How it Works:

Step 1: Finds best feature to split data
Step 2: Creates branches for each possible value
Step 3: Repeats process until pure groups formed
Step 4: Makes predictions by following decision path
Step 5: Uses majority vote in leaf nodes
Step 6: Stops when stopping criteria met

🎯 Splitting Criteria Explained:

Information Gain: Measures how much uncertainty is reduced by a split

Gini Impurity: Measures probability of incorrect classification

Entropy: Measures randomness or disorder in the data

Example Decision Process:

1. Start with all data at root
2. Find best split (e.g., Age > 30?)
3. Split data into two groups
4. Repeat for each group until stopping condition
5. Assign class label to each leaf

🚀 Complete Working Example

🌤️ Weather Prediction System - Detailed Walkthrough:

This comprehensive example shows how to build, train, and evaluate a decision tree for predicting whether to play tennis based on weather conditions.

from sklearn.tree import DecisionTreeClassifier, export_text

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Weather data: [Temperature, Humidity, Wind] → Play Tennis?

# Temperature in Fahrenheit, Humidity as percentage, Wind (0=No, 1=Yes)

weather_data = np.array([

    [85, 85, 0], # Hot, High humidity, No wind → Don't play

    [80, 90, 1], # Hot, High humidity, Windy → Don't play

    [83, 78, 0], # Hot, Normal humidity, No wind → Play

    [70, 96, 0], # Mild, High humidity, No wind → Play

    [68, 80, 0], # Cool, Normal humidity, No wind → Play

    [65, 70, 1], # Cool, Normal humidity, Windy → Don't play

    [64, 65, 1], # Cool, Normal humidity, Windy → Play

    [72, 95, 0], # Mild, High humidity, No wind → Don't play

    [69, 70, 0], # Cool, Normal humidity, No wind → Play

    [75, 80, 0], # Mild, Normal humidity, No wind → Play

    [75, 70, 1], # Mild, Normal humidity, Windy → Play

    [72, 90, 1]  # Mild, High humidity, Windy → Play

])

play_tennis = np.array([0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1])

# Split data into training and testing sets (70% train, 30% test)

X_train, X_test, y_train, y_test = train_test_split(

    weather_data, play_tennis, test_size=0.3, random_state=42)

# Create decision tree with limited depth to prevent overfitting

model = DecisionTreeClassifier(max_depth=3, random_state=42)

model.fit(X_train, y_train)

# Make predictions on test set

predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2f}")

# Show tree structure in text format

tree_rules = export_text(model, feature_names=['Temp', 'Humidity', 'Wind'])

print("Decision Rules:")

print(tree_rules[:200] + "...")

# Test with new weather conditions

new_weather = [[75, 75, 0]]  # 75°F, 75% humidity, no wind

prediction = model.predict(new_weather)

print(f"Should play tennis? {'Yes' if prediction[0] else 'No'}")

📊 Understanding the Results:

Accuracy Score: Percentage of correct predictions on test data
Tree Rules: Shows the actual decision logic learned by the algorithm
Feature Importance: Which weather factors matter most for the decision

🔍 What the Tree Learns:

The algorithm discovers patterns like:

"If humidity > 85%, usually don't play tennis"
"If temperature is between 65-75°F and humidity < 80%, usually play"
"Wind matters less than temperature and humidity"

✅ Pros & Cons

✅ Advantages:

Easy to understand and interpret: Visual tree structure makes sense to humans
No data preprocessing needed: Handles missing values and different data types
Handles both numerical and categorical data: Works with mixed feature types
Fast training and prediction: Efficient algorithms for building and using trees
Feature selection automatic: Ignores irrelevant features naturally
Non-parametric: No assumptions about data distribution

⚠️ Disadvantages:

Can overfit easily: Creates overly complex trees that memorize training data
Unstable: Small data changes can create completely different trees
Biased toward features with more levels: Favors features with many possible values
Not great for linear relationships: Struggles with simple linear patterns
Can create biased trees: If some classes dominate the dataset
Difficulty with continuous numerical values: Creates many splits for continuous data

🛠️ When to Use Decision Trees:

Best for:

Problems where you need to explain the decision process
Mixed data types (numerical and categorical)
Quick prototyping and baseline models
Rule extraction and business logic

Avoid when:

You have very small datasets (prone to overfitting)
Linear relationships are dominant
You need the highest possible accuracy (use ensemble methods instead)

🎓 Summary & Next Steps

📝 Key Takeaways:

Decision Trees are like flowcharts that ask questions about your data to make predictions. They're intuitive, interpretable, and perfect for beginners!

🔑 Remember These Points:

Interpretability: You can always explain why a decision was made
Simplicity: Easy to implement and understand
Versatility: Works with different types of data
Foundation: Building block for more advanced algorithms

🚀 Next Steps in Your ML Journey:

🌲 Ensemble Methods:

Random Forest: Multiple trees voting together
Gradient Boosting: Trees learning from mistakes
XGBoost: Optimized gradient boosting

🔧 Advanced Techniques:

Hyperparameter Tuning: Optimize tree parameters
Feature Engineering: Create better input features
Cross-Validation: Better model evaluation

💡 Practice Suggestions:

Try building decision trees for these problems:

🏠 House price prediction (regression tree)
🎬 Movie recommendation system
🏥 Medical diagnosis assistant
💰 Credit card fraud detection
📈 Stock market trend prediction

⚠️ Important Reminders:

Always remember to:

Split your data into train/validation/test sets
Use cross-validation for model evaluation
Prevent overfitting with max_depth and min_samples_split
Compare with other algorithms before final decision

🌳 Decision Trees