Feature Selection Methods

Filter Methods, Wrapper Methods, Train-Test Split & Performance Evaluation

1. Introduction to Feature Selection

What is Feature Selection?

Feature selection is the process of selecting a subset of relevant features for model construction. It's crucial for:

  • Reducing overfitting: Fewer features = less complexity
  • Improving performance: Remove noise and irrelevant features
  • Reducing training time: Less data to process
  • Better interpretability: Focus on important features
  • Storage efficiency: Less memory requirements

Types of Feature Selection Methods

Filter Methods

Select features based on statistical measures independent of any machine learning algorithm

  • Fast and scalable
  • Model-independent
  • Good for preprocessing

Wrapper Methods

Use machine learning algorithms to evaluate feature subsets

  • Model-specific
  • More accurate
  • Computationally expensive

Embedded Methods

Feature selection occurs during model training

  • Built into algorithms
  • Efficient
  • Algorithm-specific

2. Filter Methods

Univariate Statistical Tests

1. F-test (ANOVA F-statistic)

Measures the linear relationship between features and target variable.

from sklearn.feature_selection import SelectKBest, f_classif

# Select top 10 features using F-test
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)

2. Chi-Square Test

Tests independence between categorical features and target (for non-negative features).

from sklearn.feature_selection import chi2

# Chi-square test for categorical features
selector_chi2 = SelectKBest(score_func=chi2, k=10)
X_selected = selector_chi2.fit_transform(X_train, y_train)

3. Mutual Information

Measures mutual dependence between features and target (captures non-linear relationships).

from sklearn.feature_selection import mutual_info_classif

# Mutual information for classification
selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected = selector_mi.fit_transform(X_train, y_train)

Filter Methods - Pros & Cons

Advantages:

  • Fast computation
  • Scalable to large datasets
  • Model-independent
  • Good for preprocessing

Disadvantages:

  • Ignores feature interactions
  • May select redundant features
  • No consideration of model performance

3. Wrapper Methods

Recursive Feature Elimination (RFE)

Recursively eliminates features and builds model on remaining attributes.

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# RFE with Random Forest
estimator = RandomForestClassifier(n_estimators=100)
selector = RFE(estimator, n_features_to_select=10)
X_selected = selector.fit_transform(X_train, y_train)

# Get selected features
selected_features = feature_names[selector.get_support()]

RFE with Cross-Validation (RFECV)

Automatically finds the optimal number of features using cross-validation.

from sklearn.feature_selection import RFECV

# RFECV automatically selects optimal number of features
selector_cv = RFECV(estimator, step=1, cv=5, scoring='accuracy')
X_selected = selector_cv.fit_transform(X_train, y_train)

print(f"Optimal number of features: {selector_cv.n_features_}")

SelectFromModel

Selects features based on importance weights from tree-based models.

from sklearn.feature_selection import SelectFromModel

# Select features based on importance
selector_model = SelectFromModel(estimator, threshold='median')
X_selected = selector_model.fit_transform(X_train, y_train)

Wrapper Methods - Pros & Cons

Advantages:

  • Considers feature interactions
  • Model-specific optimization
  • Better performance
  • Accounts for model behavior

Disadvantages:

  • Computationally expensive
  • Risk of overfitting
  • Model-dependent
  • Not scalable for large datasets

4. Train-Test Split Strategy

Proper Data Splitting

Critical for unbiased evaluation of feature selection methods.

from sklearn.model_selection import train_test_split

# Stratified split to maintain class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,      # 70% train, 30% test
    random_state=42,    # Reproducibility
    stratify=y          # Maintain class balance
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

Feature Selection Workflow

  1. Split data into train and test sets
  2. Apply feature selection on training data only
  3. Transform both train and test sets using fitted selector
  4. Train model on selected training features
  5. Evaluate on selected test features
# Correct workflow
selector = SelectKBest(f_classif, k=10)

# Fit selector on training data only
X_train_selected = selector.fit_transform(X_train, y_train)

# Transform test data using fitted selector
X_test_selected = selector.transform(X_test)

# Train and evaluate model
model.fit(X_train_selected, y_train)
predictions = model.predict(X_test_selected)

5. Performance Evaluation

Training vs Test Performance

Essential to evaluate both training and test performance to detect overfitting.

from sklearn.metrics import accuracy_score, classification_report

def evaluate_performance(model, X_train, X_test, y_train, y_test):
    # Train model
    model.fit(X_train, y_train)

    # Predictions
    train_pred = model.predict(X_train)
    test_pred = model.predict(X_test)

    # Calculate accuracies
    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)

    # Overfitting indicator
    overfitting_gap = train_acc - test_acc

    return {
        'train_accuracy': train_acc,
        'test_accuracy': test_acc,
        'overfitting_gap': overfitting_gap
    }

Key Evaluation Metrics

Metric Description Good Range
Training Accuracy Performance on training data High (but not 100%)
Test Accuracy Performance on unseen data Close to training accuracy
Overfitting Gap Train accuracy - Test accuracy < 0.05 (5%)
Number of Features Selected feature count Minimal while maintaining performance

6. Complete Implementation Example

Step-by-Step Implementation

# Complete feature selection pipeline import numpy as np import pandas as pd from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.feature_selection import SelectKBest, f_classif, RFE from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # 1. Load and prepare data data = load_breast_cancer() X, y = data.data, data.target # 2. Train-test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y ) # 3. Standardize features scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # 4. Filter method - F-test selector_filter = SelectKBest(f_classif, k=10) X_train_filter = selector_filter.fit_transform(X_train_scaled, y_train) X_test_filter = selector_filter.transform(X_test_scaled) # 5. Wrapper method - RFE estimator = RandomForestClassifier(n_estimators=100, random_state=42) selector_wrapper = RFE(estimator, n_features_to_select=10) X_train_wrapper = selector_wrapper.fit_transform(X_train_scaled, y_train) X_test_wrapper = selector_wrapper.transform(X_test_scaled) # 6. Evaluate both methods def compare_methods(): results = {} # Original features model = RandomForestClassifier(random_state=42) model.fit(X_train_scaled, y_train) results['Original'] = { 'train_acc': accuracy_score(y_train, model.predict(X_train_scaled)), 'test_acc': accuracy_score(y_test, model.predict(X_test_scaled)), 'n_features': X_train_scaled.shape[1] } # Filter method model.fit(X_train_filter, y_train) results['Filter'] = { 'train_acc': accuracy_score(y_train, model.predict(X_train_filter)), 'test_acc': accuracy_score(y_test, model.predict(X_test_filter)), 'n_features': X_train_filter.shape[1] } # Wrapper method model.fit(X_train_wrapper, y_train) results['Wrapper'] = { 'train_acc': accuracy_score(y_train, model.predict(X_train_wrapper)), 'test_acc': accuracy_score(y_test, model.predict(X_test_wrapper)), 'n_features': X_train_wrapper.shape[1] } return results # Run comparison results = compare_methods() for method, metrics in results.items(): print(f"{method}: Train={metrics['train_acc']:.3f}, " f"Test={metrics['test_acc']:.3f}, " f"Features={metrics['n_features']}")

7. Method Comparison & Best Practices

When to Use Each Method

Method Best For Dataset Size Computational Cost
Filter Methods Large datasets, preprocessing Large (>10K features) Low
Wrapper Methods Small-medium datasets, high accuracy Small-Medium (<1K features) High
Embedded Methods Built-in feature selection Any Medium

Best Practices

  • Always split data first: Apply feature selection on training data only
  • Use cross-validation: For robust feature selection (RFECV)
  • Consider domain knowledge: Don't rely solely on statistical measures
  • Monitor overfitting: Check training vs test performance
  • Try multiple methods: Compare filter and wrapper approaches
  • Scale features: Especially important for distance-based methods
  • Validate results: Use different train-test splits

Common Pitfalls to Avoid

  • ❌ Applying feature selection on entire dataset before splitting
  • ❌ Using test data for feature selection
  • ❌ Ignoring feature scaling
  • ❌ Selecting too few features (underfitting)
  • ❌ Not validating feature selection stability
  • ❌ Focusing only on accuracy without considering interpretability