Data Preprocessing in Machine Learning

Essential steps to prepare your data for machine learning algorithms

Introduction to Data Preprocessing

Data preprocessing is the process of transforming raw data into a format that can be effectively used by machine learning algorithms. It's often considered the most crucial step in the machine learning pipeline, as the quality of your data directly impacts the performance of your model.

Key Point: "Garbage in, garbage out" - Poor quality data leads to poor model performance, regardless of the algorithm used.

What is Data Preprocessing?

Data preprocessing involves several techniques to clean, transform, and prepare data:

  • Cleaning data (handling missing values, outliers)
  • Transforming data (encoding, scaling)
  • Reducing data (feature selection, dimensionality reduction)
  • Splitting data (train/test/validation sets)

Why is Data Preprocessing Needed?

1. Real-world Data is Messy

  • Missing Values: Incomplete records due to data collection issues
  • Inconsistent Formats: Different date formats, case sensitivity
  • Outliers: Extreme values that can skew results
  • Noise: Random errors in data collection

2. Algorithm Requirements

  • Numerical Input: Most ML algorithms require numerical data
  • Scale Sensitivity: Algorithms like KNN, SVM are sensitive to feature scales
  • Distribution Assumptions: Some algorithms assume normal distribution

3. Performance Optimization

  • Faster Training: Clean data trains faster
  • Better Accuracy: Proper preprocessing improves model performance
  • Reduced Overfitting: Good data helps models generalize better
Warning: Skipping preprocessing can lead to biased models, poor performance, and unreliable predictions.

Getting Dataset

Common Data Sources

  • CSV Files: Most common format for structured data
  • Databases: SQL databases, NoSQL databases
  • APIs: Web APIs, REST services
  • Web Scraping: Extracting data from websites
  • Public Datasets: Kaggle, UCI ML Repository, government data

Sample Dataset Structure

For this lecture, we'll use a sample customer dataset:

# Sample dataset: customers.csv
Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes

Importing Libraries

Essential Python Libraries for Data Preprocessing

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

# Ignore warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

Library Functions Overview

  • pandas: Data manipulation, reading/writing files
  • numpy: Numerical operations, array handling
  • sklearn: Machine learning preprocessing tools
  • matplotlib/seaborn: Data visualization
Installation: pip install pandas numpy scikit-learn matplotlib seaborn

Importing Dataset

Reading Data with Pandas

# Import the dataset dataset = pd.read_csv('customers.csv') # Display basic information about the dataset print("Dataset shape:", dataset.shape) print("\nFirst 5 rows:") print(dataset.head()) print("\nDataset info:") print(dataset.info()) print("\nStatistical summary:") print(dataset.describe())

Exploring the Dataset

# Check for missing values print("Missing values per column:") print(dataset.isnull().sum()) # Check data types print("\nData types:") print(dataset.dtypes) # Check unique values in categorical columns print("\nUnique countries:", dataset['Country'].unique()) print("Unique purchased values:", dataset['Purchased'].unique())

Separating Features and Target

# Separate independent variables (features) and dependent variable (target) X = dataset.iloc[:, :-1].values # All columns except the last one y = dataset.iloc[:, -1].values # Last column (target variable) print("Features shape:", X.shape) print("Target shape:", y.shape) print("\nFeatures (first 5 rows):") print(X[:5]) print("\nTarget (first 5 values):") print(y[:5])

Handling Missing Data

Why Missing Data Occurs

  • Data collection errors: Sensor failures, human errors
  • Privacy concerns: Users not providing sensitive information
  • System issues: Database corruption, network problems
  • Survey non-response: Participants skipping questions

Types of Missing Data

  • MCAR (Missing Completely at Random): Missing values are random
  • MAR (Missing at Random): Missing depends on observed data
  • MNAR (Missing Not at Random): Missing depends on unobserved data

Strategies for Handling Missing Data

1. Deletion Methods

# Remove rows with any missing values dataset_dropna = dataset.dropna() print("Original shape:", dataset.shape) print("After dropping NaN:", dataset_dropna.shape) # Remove columns with missing values dataset_drop_cols = dataset.dropna(axis=1) print("After dropping columns with NaN:", dataset_drop_cols.shape)

2. Imputation Methods

# Using SimpleImputer for numerical data from sklearn.impute import SimpleImputer # Create imputer for numerical columns (Age, Salary) imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # Fit and transform numerical columns (columns 1 and 2: Age and Salary) X[:, 1:3] = imputer.fit_transform(X[:, 1:3]) print("After imputation (first 10 rows):") print(X[:10])

3. Different Imputation Strategies

# Mean imputation (for numerical data) imputer_mean = SimpleImputer(strategy='mean') # Median imputation (robust to outliers) imputer_median = SimpleImputer(strategy='median') # Mode imputation (for categorical data) imputer_mode = SimpleImputer(strategy='most_frequent') # Constant imputation imputer_constant = SimpleImputer(strategy='constant', fill_value=0) # Example: Using median for salary column X_copy = X.copy() X_copy[:, 2:3] = imputer_median.fit_transform(X_copy[:, 2:3])
Important: Always fit the imputer on training data only, then transform both training and test data to avoid data leakage.

Encoding Categorical Data

Why Encode Categorical Data?

Machine learning algorithms work with numerical data. Categorical data (text labels) must be converted to numerical format.

Types of Categorical Data

  • Nominal: No inherent order (Country: France, Spain, Germany)
  • Ordinal: Has order (Education: High School < Bachelor < Master < PhD)

1. Label Encoding

Converts categories to integers (0, 1, 2, ...)

from sklearn.preprocessing import LabelEncoder # Encode the target variable (Purchased: No=0, Yes=1) label_encoder_y = LabelEncoder() y = label_encoder_y.fit_transform(y) print("Original target values:", dataset['Purchased'].unique()) print("Encoded target values:", np.unique(y)) print("Mapping: No=0, Yes=1")

2. One-Hot Encoding

Creates binary columns for each category (recommended for nominal data)

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder # One-hot encode the Country column (column 0) ct = ColumnTransformer( transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough' ) X = np.array(ct.fit_transform(X)) print("Shape after one-hot encoding:", X.shape) print("First 5 rows after encoding:") print(X[:5]) # The columns are now: [France, Germany, Spain, Age, Salary]

3. Manual One-Hot Encoding with Pandas

# Alternative method using pandas df = pd.DataFrame(dataset) # Create dummy variables for Country country_dummies = pd.get_dummies(df['Country'], prefix='Country') print("Country dummy variables:") print(country_dummies.head()) # Combine with other features df_encoded = pd.concat([country_dummies, df[['Age', 'Salary']]], axis=1) print("\nFinal encoded dataset:") print(df_encoded.head())
Dummy Variable Trap: When using one-hot encoding, you can drop one column to avoid multicollinearity (n categories → n-1 columns).

Splitting Dataset into Train-Test Sets

Why Split the Data?

  • Training Set: Used to train the model
  • Test Set: Used to evaluate model performance on unseen data
  • Validation Set: Used for hyperparameter tuning (optional)

Common Split Ratios

  • 80-20: 80% training, 20% testing
  • 70-30: 70% training, 30% testing
  • 60-20-20: 60% training, 20% validation, 20% testing

Implementation

from sklearn.model_selection import train_test_split # Split the dataset (80% train, 20% test) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, # 20% for testing random_state=42, # For reproducible results stratify=y # Maintain class distribution ) print("Training set shape:", X_train.shape) print("Test set shape:", X_test.shape) print("Training target shape:", y_train.shape) print("Test target shape:", y_test.shape) # Check class distribution print("\nClass distribution in training set:") unique, counts = np.unique(y_train, return_counts=True) print(dict(zip(unique, counts))) print("\nClass distribution in test set:") unique, counts = np.unique(y_test, return_counts=True) print(dict(zip(unique, counts)))

Advanced Splitting Techniques

# Time series split (for temporal data) from sklearn.model_selection import TimeSeriesSplit # K-fold cross-validation split from sklearn.model_selection import KFold # Stratified K-fold (maintains class distribution) from sklearn.model_selection import StratifiedKFold # Example: 5-fold cross-validation kfold = KFold(n_splits=5, shuffle=True, random_state=42) for train_idx, val_idx in kfold.split(X_train): X_fold_train, X_fold_val = X_train[train_idx], X_train[val_idx] y_fold_train, y_fold_val = y_train[train_idx], y_train[val_idx] # Train and validate model here
Important: Never use test data for any preprocessing decisions. Fit preprocessing on training data only!

Feature Scaling

What is Feature Scaling?

Feature scaling transforms features to similar scales to prevent features with larger ranges from dominating the learning process.

When is Feature Scaling Needed?

  • Distance-based algorithms: KNN, K-means, SVM
  • Gradient-based algorithms: Neural networks, logistic regression
  • Regularized algorithms: Ridge, Lasso regression

When Feature Scaling is NOT Needed

  • Tree-based algorithms: Decision trees, Random Forest, XGBoost
  • Naive Bayes

Example: Why Scaling Matters

# Example data showing the scale difference print("Sample data before scaling:") print("Age range:", X_train[:, -2].min(), "to", X_train[:, -2].max()) print("Salary range:", X_train[:, -1].min(), "to", X_train[:, -1].max()) # Age: 27-50, Salary: 48000-83000 # Without scaling, salary will dominate distance calculations!

Types of Feature Scaling

There are two main approaches: Standardization and Normalization

Standardization (Z-score Normalization)

Formula

z = (x - μ) / σ

  • μ = mean of the feature
  • σ = standard deviation of the feature
  • Result: mean = 0, standard deviation = 1

When to Use Standardization

  • Data follows normal distribution
  • Features have different units (age vs salary)
  • Presence of outliers (standardization is less sensitive)
  • Most common choice for feature scaling

Implementation

from sklearn.preprocessing import StandardScaler # Create StandardScaler object sc = StandardScaler() # Fit on training data and transform both training and test data # Only scale numerical features (last 2 columns: Age and Salary) X_train[:, -2:] = sc.fit_transform(X_train[:, -2:]) X_test[:, -2:] = sc.transform(X_test[:, -2:]) print("After standardization:") print("Training data (first 5 rows):") print(X_train[:5]) print("\nMean of scaled features (should be ~0):") print("Age mean:", np.mean(X_train[:, -2])) print("Salary mean:", np.mean(X_train[:, -1])) print("\nStd of scaled features (should be ~1):") print("Age std:", np.std(X_train[:, -2])) print("Salary std:", np.std(X_train[:, -1]))

Manual Standardization

# Manual implementation for understanding def manual_standardization(X_train, X_test): # Calculate mean and std from training data mean = np.mean(X_train, axis=0) std = np.std(X_train, axis=0) # Apply to both training and test data X_train_scaled = (X_train - mean) / std X_test_scaled = (X_test - mean) / std return X_train_scaled, X_test_scaled # Example usage age_salary_train = X_train[:, -2:].astype(float) age_salary_test = X_test[:, -2:].astype(float) scaled_train, scaled_test = manual_standardization(age_salary_train, age_salary_test)

Normalization (Min-Max Scaling)

Formula

x_norm = (x - min) / (max - min)

  • Result: values between 0 and 1
  • Preserves the original distribution shape
  • All features have the same scale [0, 1]

When to Use Normalization

  • Data doesn't follow normal distribution
  • You want to preserve the original distribution
  • Neural networks (often prefer [0, 1] range)
  • When you know the approximate upper and lower bounds

Implementation

from sklearn.preprocessing import MinMaxScaler # Create MinMaxScaler object scaler = MinMaxScaler() # Reset data for demonstration X_train_norm = X_train.copy() X_test_norm = X_test.copy() # Fit on training data and transform both sets X_train_norm[:, -2:] = scaler.fit_transform(X_train_norm[:, -2:]) X_test_norm[:, -2:] = scaler.transform(X_test_norm[:, -2:]) print("After normalization:") print("Training data (first 5 rows):") print(X_train_norm[:5]) print("\nMin values (should be ~0):") print("Age min:", np.min(X_train_norm[:, -2])) print("Salary min:", np.min(X_train_norm[:, -1])) print("\nMax values (should be ~1):") print("Age max:", np.max(X_train_norm[:, -2])) print("Salary max:", np.max(X_train_norm[:, -1]))

Custom Range Normalization

# Normalize to custom range [a, b] scaler_custom = MinMaxScaler(feature_range=(-1, 1)) # Range [-1, 1] # Or manual implementation def normalize_to_range(X, min_val=0, max_val=1): X_min = np.min(X, axis=0) X_max = np.max(X, axis=0) X_scaled = (X - X_min) / (X_max - X_min) return X_scaled * (max_val - min_val) + min_val # Example: normalize to [-1, 1] X_custom = normalize_to_range(X_train[:, -2:], -1, 1) print("Custom range [-1, 1]:") print("Min:", np.min(X_custom, axis=0)) print("Max:", np.max(X_custom, axis=0))

Standardization vs Normalization Comparison

# Comparison table import pandas as pd comparison = pd.DataFrame({ 'Aspect': ['Output Range', 'Distribution', 'Outlier Sensitivity', 'Use Case'], 'Standardization': ['(-∞, +∞)', 'Normal (μ=0, σ=1)', 'Less sensitive', 'Most algorithms'], 'Normalization': ['[0, 1]', 'Preserves original', 'More sensitive', 'Neural networks, bounded data'] }) print(comparison.to_string(index=False))
Key Rule: Always fit the scaler on training data only, then transform both training and test data using the same scaler parameters.

Summary

Complete Data Preprocessing Pipeline

# Complete preprocessing pipeline example import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer def preprocess_data(file_path): # 1. Load dataset dataset = pd.read_csv(file_path) # 2. Separate features and target X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # 3. Handle missing data imputer = SimpleImputer(missing_values=np.nan, strategy='mean') X[:, 1:3] = imputer.fit_transform(X[:, 1:3]) # 4. Encode categorical data # One-hot encode independent variable ct = ColumnTransformer( transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough' ) X = np.array(ct.fit_transform(X)) # Label encode dependent variable le = LabelEncoder() y = le.fit_transform(y) # 5. Split dataset X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # 6. Feature scaling sc = StandardScaler() X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) X_test[:, 3:] = sc.transform(X_test[:, 3:]) return X_train, X_test, y_train, y_test # Usage # X_train, X_test, y_train, y_test = preprocess_data('customers.csv')

Key Takeaways

  • Data Quality: Good preprocessing is crucial for model performance
  • Missing Data: Choose appropriate strategy based on data type and missingness pattern
  • Categorical Encoding: Use label encoding for ordinal, one-hot for nominal data
  • Train-Test Split: Always split before preprocessing to avoid data leakage
  • Feature Scaling: Choose standardization or normalization based on data distribution and algorithm requirements
  • Fit-Transform Rule: Fit on training data, transform both training and test data

Best Practices

  1. Understand your data first (EDA - Exploratory Data Analysis)
  2. Handle missing data appropriately
  3. Choose encoding methods based on data type
  4. Split data before any preprocessing
  5. Scale features when necessary
  6. Document your preprocessing steps
  7. Create reusable preprocessing pipelines
Next Steps: After preprocessing, your data is ready for machine learning algorithms. The quality of preprocessing directly impacts model performance, so invest time in this crucial step!