1. Introduction to Feature Selection
What is Feature Selection?
Feature selection is the process of selecting a subset of relevant features for model construction. It's crucial for:
- Reducing overfitting: Fewer features = less complexity
- Improving performance: Remove noise and irrelevant features
- Reducing training time: Less data to process
- Better interpretability: Focus on important features
- Storage efficiency: Less memory requirements
Types of Feature Selection Methods
Filter Methods
Select features based on statistical measures independent of any machine learning algorithm
- Fast and scalable
- Model-independent
- Good for preprocessing
Wrapper Methods
Use machine learning algorithms to evaluate feature subsets
- Model-specific
- More accurate
- Computationally expensive
Embedded Methods
Feature selection occurs during model training
- Built into algorithms
- Efficient
- Algorithm-specific
2. Filter Methods
Univariate Statistical Tests
1. F-test (ANOVA F-statistic)
Measures the linear relationship between features and target variable.
# Select top 10 features using F-test
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_train, y_train)
2. Chi-Square Test
Tests independence between categorical features and target (for non-negative features).
# Chi-square test for categorical features
selector_chi2 = SelectKBest(score_func=chi2, k=10)
X_selected = selector_chi2.fit_transform(X_train, y_train)
3. Mutual Information
Measures mutual dependence between features and target (captures non-linear relationships).
# Mutual information for classification
selector_mi = SelectKBest(score_func=mutual_info_classif, k=10)
X_selected = selector_mi.fit_transform(X_train, y_train)
Filter Methods - Pros & Cons
Advantages:
- Fast computation
- Scalable to large datasets
- Model-independent
- Good for preprocessing
Disadvantages:
- Ignores feature interactions
- May select redundant features
- No consideration of model performance
3. Wrapper Methods
Recursive Feature Elimination (RFE)
Recursively eliminates features and builds model on remaining attributes.
from sklearn.ensemble import RandomForestClassifier
# RFE with Random Forest
estimator = RandomForestClassifier(n_estimators=100)
selector = RFE(estimator, n_features_to_select=10)
X_selected = selector.fit_transform(X_train, y_train)
# Get selected features
selected_features = feature_names[selector.get_support()]
RFE with Cross-Validation (RFECV)
Automatically finds the optimal number of features using cross-validation.
# RFECV automatically selects optimal number of features
selector_cv = RFECV(estimator, step=1, cv=5, scoring='accuracy')
X_selected = selector_cv.fit_transform(X_train, y_train)
print(f"Optimal number of features: {selector_cv.n_features_}")
SelectFromModel
Selects features based on importance weights from tree-based models.
# Select features based on importance
selector_model = SelectFromModel(estimator, threshold='median')
X_selected = selector_model.fit_transform(X_train, y_train)
Wrapper Methods - Pros & Cons
Advantages:
- Considers feature interactions
- Model-specific optimization
- Better performance
- Accounts for model behavior
Disadvantages:
- Computationally expensive
- Risk of overfitting
- Model-dependent
- Not scalable for large datasets
4. Train-Test Split Strategy
Proper Data Splitting
Critical for unbiased evaluation of feature selection methods.
# Stratified split to maintain class distribution
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3, # 70% train, 30% test
random_state=42, # Reproducibility
stratify=y # Maintain class balance
)
print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
Feature Selection Workflow
- Split data into train and test sets
- Apply feature selection on training data only
- Transform both train and test sets using fitted selector
- Train model on selected training features
- Evaluate on selected test features
selector = SelectKBest(f_classif, k=10)
# Fit selector on training data only
X_train_selected = selector.fit_transform(X_train, y_train)
# Transform test data using fitted selector
X_test_selected = selector.transform(X_test)
# Train and evaluate model
model.fit(X_train_selected, y_train)
predictions = model.predict(X_test_selected)
5. Performance Evaluation
Training vs Test Performance
Essential to evaluate both training and test performance to detect overfitting.
def evaluate_performance(model, X_train, X_test, y_train, y_test):
# Train model
model.fit(X_train, y_train)
# Predictions
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
# Calculate accuracies
train_acc = accuracy_score(y_train, train_pred)
test_acc = accuracy_score(y_test, test_pred)
# Overfitting indicator
overfitting_gap = train_acc - test_acc
return {
'train_accuracy': train_acc,
'test_accuracy': test_acc,
'overfitting_gap': overfitting_gap
}
Key Evaluation Metrics
| Metric | Description | Good Range |
|---|---|---|
| Training Accuracy | Performance on training data | High (but not 100%) |
| Test Accuracy | Performance on unseen data | Close to training accuracy |
| Overfitting Gap | Train accuracy - Test accuracy | < 0.05 (5%) |
| Number of Features | Selected feature count | Minimal while maintaining performance |
6. Complete Implementation Example
Step-by-Step Implementation
7. Method Comparison & Best Practices
When to Use Each Method
| Method | Best For | Dataset Size | Computational Cost |
|---|---|---|---|
| Filter Methods | Large datasets, preprocessing | Large (>10K features) | Low |
| Wrapper Methods | Small-medium datasets, high accuracy | Small-Medium (<1K features) | High |
| Embedded Methods | Built-in feature selection | Any | Medium |
Best Practices
- Always split data first: Apply feature selection on training data only
- Use cross-validation: For robust feature selection (RFECV)
- Consider domain knowledge: Don't rely solely on statistical measures
- Monitor overfitting: Check training vs test performance
- Try multiple methods: Compare filter and wrapper approaches
- Scale features: Especially important for distance-based methods
- Validate results: Use different train-test splits
Common Pitfalls to Avoid
- ❌ Applying feature selection on entire dataset before splitting
- ❌ Using test data for feature selection
- ❌ Ignoring feature scaling
- ❌ Selecting too few features (underfitting)
- ❌ Not validating feature selection stability
- ❌ Focusing only on accuracy without considering interpretability