Feature Selection Methods - Complete ML Lecture

1. Introduction to Feature Selection

What is Feature Selection?

Feature selection is the process of selecting a subset of relevant features for model construction. It's crucial for:

Reducing overfitting: Fewer features = less complexity
Improving performance: Remove noise and irrelevant features
Reducing training time: Less data to process
Better interpretability: Focus on important features
Storage efficiency: Less memory requirements

Types of Feature Selection Methods

Filter Methods

Select features based on statistical measures independent of any machine learning algorithm

Fast and scalable
Model-independent
Good for preprocessing

Wrapper Methods

Use machine learning algorithms to evaluate feature subsets

Model-specific
More accurate
Computationally expensive

Embedded Methods

Feature selection occurs during model training

Built into algorithms
Efficient
Algorithm-specific

2. Filter Methods

Univariate Statistical Tests

1. F-test (ANOVA F-statistic)

Measures the linear relationship between features and target variable.

from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features using F-test

selector = SelectKBest(score_func=f_classif, k=10)

X_selected = selector.fit_transform(X_train, y_train)

2. Chi-Square Test

Tests independence between categorical features and target (for non-negative features).

from sklearn.feature_selection import chi2

# Chi-square test for categorical features

selector_chi2 = SelectKBest(score_func=chi2, k=10)

X_selected = selector_chi2.fit_transform(X_train, y_train)

3. Mutual Information

Measures mutual dependence between features and target (captures non-linear relationships).

from sklearn.feature_selection import mutual_info_classif

# Mutual information for classification

selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)

X_selected = selector_mi.fit_transform(X_train, y_train)

Filter Methods - Pros & Cons

Advantages:

Fast computation
Scalable to large datasets
Model-independent
Good for preprocessing

Disadvantages:

Ignores feature interactions
May select redundant features
No consideration of model performance

3. Wrapper Methods

Recursive Feature Elimination (RFE)

Recursively eliminates features and builds model on remaining attributes.

from sklearn.feature_selection import RFE

from sklearn.ensemble import RandomForestClassifier

# RFE with Random Forest

estimator = RandomForestClassifier(n_estimators=100)

selector = RFE(estimator, n_features_to_select=10)

X_selected = selector.fit_transform(X_train, y_train)

# Get selected features

selected_features = feature_names[selector.get_support()]

RFE with Cross-Validation (RFECV)

Automatically finds the optimal number of features using cross-validation.

from sklearn.feature_selection import RFECV

# RFECV automatically selects optimal number of features

selector_cv = RFECV(estimator, step=1, cv=5, scoring='accuracy')

X_selected = selector_cv.fit_transform(X_train, y_train)

print(f"Optimal number of features: {selector_cv.n_features_}")

SelectFromModel

Selects features based on importance weights from tree-based models.

from sklearn.feature_selection import SelectFromModel

# Select features based on importance

selector_model = SelectFromModel(estimator, threshold='median')

X_selected = selector_model.fit_transform(X_train, y_train)

Wrapper Methods - Pros & Cons

Advantages:

Considers feature interactions
Model-specific optimization
Better performance
Accounts for model behavior

Disadvantages:

Computationally expensive
Risk of overfitting
Model-dependent
Not scalable for large datasets

4. Train-Test Split Strategy

Proper Data Splitting

Critical for unbiased evaluation of feature selection methods.

from sklearn.model_selection import train_test_split

# Stratified split to maintain class distribution

X_train, X_test, y_train, y_test = train_test_split(

    X, y, 

    test_size=0.3,      # 70% train, 30% test

    random_state=42,    # Reproducibility

    stratify=y          # Maintain class balance

)

print(f"Training set: {X_train.shape}")

print(f"Test set: {X_test.shape}")

Feature Selection Workflow

Split data into train and test sets
Apply feature selection on training data only
Transform both train and test sets using fitted selector
Train model on selected training features
Evaluate on selected test features

# Correct workflow

selector = SelectKBest(f_classif, k=10)

# Fit selector on training data only

X_train_selected = selector.fit_transform(X_train, y_train)

# Transform test data using fitted selector

X_test_selected = selector.transform(X_test)

# Train and evaluate model

model.fit(X_train_selected, y_train)

predictions = model.predict(X_test_selected)

5. Performance Evaluation

Training vs Test Performance

Essential to evaluate both training and test performance to detect overfitting.

from sklearn.metrics import accuracy_score, classification_report

def evaluate_performance(model, X_train, X_test, y_train, y_test):

    # Train model

    model.fit(X_train, y_train)

    # Predictions

    train_pred = model.predict(X_train)

    test_pred = model.predict(X_test)

    # Calculate accuracies

    train_acc = accuracy_score(y_train, train_pred)

    test_acc = accuracy_score(y_test, test_pred)

    # Overfitting indicator

    overfitting_gap = train_acc - test_acc

    return {

        'train_accuracy': train_acc,

        'test_accuracy': test_acc,

        'overfitting_gap': overfitting_gap

    }

Key Evaluation Metrics

Metric	Description	Good Range
Training Accuracy	Performance on training data	High (but not 100%)
Test Accuracy	Performance on unseen data	Close to training accuracy
Overfitting Gap	Train accuracy - Test accuracy	< 0.05 (5%)
Number of Features	Selected feature count	Minimal while maintaining performance

6. Complete Implementation Example

Step-by-Step Implementation

# Complete feature selection pipeline
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# 1. Load and prepare data
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 3. Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. Filter method - F-test
selector_filter = SelectKBest(f_classif, k=10)
X_train_filter = selector_filter.fit_transform(X_train_scaled, y_train)
X_test_filter = selector_filter.transform(X_test_scaled)

# 5. Wrapper method - RFE
estimator = RandomForestClassifier(n_estimators=100, random_state=42)
selector_wrapper = RFE(estimator, n_features_to_select=10)
X_train_wrapper = selector_wrapper.fit_transform(X_train_scaled, y_train)
X_test_wrapper = selector_wrapper.transform(X_test_scaled)

# 6. Evaluate both methods
def compare_methods():
    results = {}
    
    # Original features
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train_scaled, y_train)
    results['Original'] = {
        'train_acc': accuracy_score(y_train, model.predict(X_train_scaled)),
        'test_acc': accuracy_score(y_test, model.predict(X_test_scaled)),
        'n_features': X_train_scaled.shape[1]
    }
    
    # Filter method
    model.fit(X_train_filter, y_train)
    results['Filter'] = {
        'train_acc': accuracy_score(y_train, model.predict(X_train_filter)),
        'test_acc': accuracy_score(y_test, model.predict(X_test_filter)),
        'n_features': X_train_filter.shape[1]
    }
    
    # Wrapper method
    model.fit(X_train_wrapper, y_train)
    results['Wrapper'] = {
        'train_acc': accuracy_score(y_train, model.predict(X_train_wrapper)),
        'test_acc': accuracy_score(y_test, model.predict(X_test_wrapper)),
        'n_features': X_train_wrapper.shape[1]
    }
    
    return results

# Run comparison
results = compare_methods()
for method, metrics in results.items():
    print(f"{method}: Train={metrics['train_acc']:.3f}, "
          f"Test={metrics['test_acc']:.3f}, "
          f"Features={metrics['n_features']}")
                    

7. Method Comparison & Best Practices

When to Use Each Method

Method	Best For	Dataset Size	Computational Cost
Filter Methods	Large datasets, preprocessing	Large (>10K features)	Low
Wrapper Methods	Small-medium datasets, high accuracy	Small-Medium (<1K features)	High
Embedded Methods	Built-in feature selection	Any	Medium

Best Practices

Always split data first: Apply feature selection on training data only
Use cross-validation: For robust feature selection (RFECV)
Consider domain knowledge: Don't rely solely on statistical measures
Monitor overfitting: Check training vs test performance
Try multiple methods: Compare filter and wrapper approaches
Scale features: Especially important for distance-based methods
Validate results: Use different train-test splits

Common Pitfalls to Avoid

❌ Applying feature selection on entire dataset before splitting
❌ Using test data for feature selection
❌ Ignoring feature scaling
❌ Selecting too few features (underfitting)
❌ Not validating feature selection stability
❌ Focusing only on accuracy without considering interpretability

Interactive Tutorial

📓 Jupyter Notebook: Download Interactive Tutorial

Run the notebook for hands-on practice with step-by-step code examples and visualizations.

Summary

Feature selection is crucial for building efficient, interpretable machine learning models. Choose the right method based on your dataset size, computational resources, and accuracy requirements. Always validate your approach using proper train-test evaluation.

Quick Demo:

python3 simple_feature_selection_demo.py

Full Implementation:

python3 feature_selection_implementation.py