Data Preprocessing in Machine Learning
Essential steps to prepare your data for machine learning algorithms
Introduction to Data Preprocessing
Data preprocessing is the process of transforming raw data into a format that can be effectively used by machine learning algorithms. It's often considered the most crucial step in the machine learning pipeline, as the quality of your data directly impacts the performance of your model.
What is Data Preprocessing?
Data preprocessing involves several techniques to clean, transform, and prepare data:
- Cleaning data (handling missing values, outliers)
- Transforming data (encoding, scaling)
- Reducing data (feature selection, dimensionality reduction)
- Splitting data (train/test/validation sets)
Why is Data Preprocessing Needed?
1. Real-world Data is Messy
- Missing Values: Incomplete records due to data collection issues
- Inconsistent Formats: Different date formats, case sensitivity
- Outliers: Extreme values that can skew results
- Noise: Random errors in data collection
2. Algorithm Requirements
- Numerical Input: Most ML algorithms require numerical data
- Scale Sensitivity: Algorithms like KNN, SVM are sensitive to feature scales
- Distribution Assumptions: Some algorithms assume normal distribution
3. Performance Optimization
- Faster Training: Clean data trains faster
- Better Accuracy: Proper preprocessing improves model performance
- Reduced Overfitting: Good data helps models generalize better
Getting Dataset
Common Data Sources
- CSV Files: Most common format for structured data
- Databases: SQL databases, NoSQL databases
- APIs: Web APIs, REST services
- Web Scraping: Extracting data from websites
- Public Datasets: Kaggle, UCI ML Repository, government data
Sample Dataset Structure
For this lecture, we'll use a sample customer dataset:
Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes
Importing Libraries
Essential Python Libraries for Data Preprocessing
import pandas as pd
import numpy as np
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Machine learning preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
# Ignore warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')
Library Functions Overview
- pandas: Data manipulation, reading/writing files
- numpy: Numerical operations, array handling
- sklearn: Machine learning preprocessing tools
- matplotlib/seaborn: Data visualization
Importing Dataset
Reading Data with Pandas
Exploring the Dataset
Separating Features and Target
Handling Missing Data
Why Missing Data Occurs
- Data collection errors: Sensor failures, human errors
- Privacy concerns: Users not providing sensitive information
- System issues: Database corruption, network problems
- Survey non-response: Participants skipping questions
Types of Missing Data
- MCAR (Missing Completely at Random): Missing values are random
- MAR (Missing at Random): Missing depends on observed data
- MNAR (Missing Not at Random): Missing depends on unobserved data
Strategies for Handling Missing Data
1. Deletion Methods
2. Imputation Methods
3. Different Imputation Strategies
Encoding Categorical Data
Why Encode Categorical Data?
Machine learning algorithms work with numerical data. Categorical data (text labels) must be converted to numerical format.
Types of Categorical Data
- Nominal: No inherent order (Country: France, Spain, Germany)
- Ordinal: Has order (Education: High School < Bachelor < Master < PhD)
1. Label Encoding
Converts categories to integers (0, 1, 2, ...)
2. One-Hot Encoding
Creates binary columns for each category (recommended for nominal data)
3. Manual One-Hot Encoding with Pandas
Splitting Dataset into Train-Test Sets
Why Split the Data?
- Training Set: Used to train the model
- Test Set: Used to evaluate model performance on unseen data
- Validation Set: Used for hyperparameter tuning (optional)
Common Split Ratios
- 80-20: 80% training, 20% testing
- 70-30: 70% training, 30% testing
- 60-20-20: 60% training, 20% validation, 20% testing
Implementation
Advanced Splitting Techniques
Feature Scaling
What is Feature Scaling?
Feature scaling transforms features to similar scales to prevent features with larger ranges from dominating the learning process.
When is Feature Scaling Needed?
- Distance-based algorithms: KNN, K-means, SVM
- Gradient-based algorithms: Neural networks, logistic regression
- Regularized algorithms: Ridge, Lasso regression
When Feature Scaling is NOT Needed
- Tree-based algorithms: Decision trees, Random Forest, XGBoost
- Naive Bayes
Example: Why Scaling Matters
Types of Feature Scaling
There are two main approaches: Standardization and Normalization
Standardization (Z-score Normalization)
Formula
z = (x - μ) / σ
- μ = mean of the feature
- σ = standard deviation of the feature
- Result: mean = 0, standard deviation = 1
When to Use Standardization
- Data follows normal distribution
- Features have different units (age vs salary)
- Presence of outliers (standardization is less sensitive)
- Most common choice for feature scaling
Implementation
Manual Standardization
Normalization (Min-Max Scaling)
Formula
x_norm = (x - min) / (max - min)
- Result: values between 0 and 1
- Preserves the original distribution shape
- All features have the same scale [0, 1]
When to Use Normalization
- Data doesn't follow normal distribution
- You want to preserve the original distribution
- Neural networks (often prefer [0, 1] range)
- When you know the approximate upper and lower bounds
Implementation
Custom Range Normalization
Standardization vs Normalization Comparison
Summary
Complete Data Preprocessing Pipeline
Key Takeaways
- Data Quality: Good preprocessing is crucial for model performance
- Missing Data: Choose appropriate strategy based on data type and missingness pattern
- Categorical Encoding: Use label encoding for ordinal, one-hot for nominal data
- Train-Test Split: Always split before preprocessing to avoid data leakage
- Feature Scaling: Choose standardization or normalization based on data distribution and algorithm requirements
- Fit-Transform Rule: Fit on training data, transform both training and test data
Best Practices
- Understand your data first (EDA - Exploratory Data Analysis)
- Handle missing data appropriately
- Choose encoding methods based on data type
- Split data before any preprocessing
- Scale features when necessary
- Document your preprocessing steps
- Create reusable preprocessing pipelines