Data Preprocessing in Machine Learning

Introduction to Data Preprocessing

Data preprocessing is the process of transforming raw data into a format that can be effectively used by machine learning algorithms. It's often considered the most crucial step in the machine learning pipeline, as the quality of your data directly impacts the performance of your model.

Key Point: "Garbage in, garbage out" - Poor quality data leads to poor model performance, regardless of the algorithm used.

What is Data Preprocessing?

Data preprocessing involves several techniques to clean, transform, and prepare data:

Cleaning data (handling missing values, outliers)
Transforming data (encoding, scaling)
Reducing data (feature selection, dimensionality reduction)
Splitting data (train/test/validation sets)

Why is Data Preprocessing Needed?

1. Real-world Data is Messy

Missing Values: Incomplete records due to data collection issues
Inconsistent Formats: Different date formats, case sensitivity
Outliers: Extreme values that can skew results
Noise: Random errors in data collection

2. Algorithm Requirements

Numerical Input: Most ML algorithms require numerical data
Scale Sensitivity: Algorithms like KNN, SVM are sensitive to feature scales
Distribution Assumptions: Some algorithms assume normal distribution

3. Performance Optimization

Faster Training: Clean data trains faster
Better Accuracy: Proper preprocessing improves model performance
Reduced Overfitting: Good data helps models generalize better

Warning: Skipping preprocessing can lead to biased models, poor performance, and unreliable predictions.

Getting Dataset

Common Data Sources

CSV Files: Most common format for structured data
Databases: SQL databases, NoSQL databases
APIs: Web APIs, REST services
Web Scraping: Extracting data from websites
Public Datasets: Kaggle, UCI ML Repository, government data

Sample Dataset Structure

For this lecture, we'll use a sample customer dataset:

# Sample dataset: customers.csv

Country,Age,Salary,Purchased

France,44,72000,No

Spain,27,48000,Yes

Germany,30,54000,No

Spain,38,61000,No

Germany,40,,Yes

France,35,58000,Yes

Spain,,52000,No

France,48,79000,Yes

Germany,50,83000,No

France,37,67000,Yes

Importing Libraries

Essential Python Libraries for Data Preprocessing

# Data manipulation and analysis

import pandas as pd

import numpy as np

# Data visualization

import matplotlib.pyplot as plt

import seaborn as sns

# Machine learning preprocessing

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.impute import SimpleImputer

# Ignore warnings for cleaner output

import warnings

warnings.filterwarnings('ignore')

Library Functions Overview

pandas: Data manipulation, reading/writing files
numpy: Numerical operations, array handling
sklearn: Machine learning preprocessing tools
matplotlib/seaborn: Data visualization

Installation: pip install pandas numpy scikit-learn matplotlib seaborn

Importing Dataset

Reading Data with Pandas

# Import the dataset
dataset = pd.read_csv('customers.csv')

# Display basic information about the dataset
print("Dataset shape:", dataset.shape)
print("\nFirst 5 rows:")
print(dataset.head())

print("\nDataset info:")
print(dataset.info())

print("\nStatistical summary:")
print(dataset.describe())
                

Exploring the Dataset

# Check for missing values
print("Missing values per column:")
print(dataset.isnull().sum())

# Check data types
print("\nData types:")
print(dataset.dtypes)

# Check unique values in categorical columns
print("\nUnique countries:", dataset['Country'].unique())
print("Unique purchased values:", dataset['Purchased'].unique())
                

Separating Features and Target

# Separate independent variables (features) and dependent variable (target)
X = dataset.iloc[:, :-1].values  # All columns except the last one
y = dataset.iloc[:, -1].values   # Last column (target variable)

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nFeatures (first 5 rows):")
print(X[:5])
print("\nTarget (first 5 values):")
print(y[:5])
                

Handling Missing Data

Why Missing Data Occurs

Data collection errors: Sensor failures, human errors
Privacy concerns: Users not providing sensitive information
System issues: Database corruption, network problems
Survey non-response: Participants skipping questions

Types of Missing Data

MCAR (Missing Completely at Random): Missing values are random
MAR (Missing at Random): Missing depends on observed data
MNAR (Missing Not at Random): Missing depends on unobserved data

Strategies for Handling Missing Data

1. Deletion Methods

# Remove rows with any missing values
dataset_dropna = dataset.dropna()
print("Original shape:", dataset.shape)
print("After dropping NaN:", dataset_dropna.shape)

# Remove columns with missing values
dataset_drop_cols = dataset.dropna(axis=1)
print("After dropping columns with NaN:", dataset_drop_cols.shape)
                

2. Imputation Methods

# Using SimpleImputer for numerical data
from sklearn.impute import SimpleImputer

# Create imputer for numerical columns (Age, Salary)
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform numerical columns (columns 1 and 2: Age and Salary)
X[:, 1:3] = imputer.fit_transform(X[:, 1:3])

print("After imputation (first 10 rows):")
print(X[:10])
                

3. Different Imputation Strategies

# Mean imputation (for numerical data)
imputer_mean = SimpleImputer(strategy='mean')

# Median imputation (robust to outliers)
imputer_median = SimpleImputer(strategy='median')

# Mode imputation (for categorical data)
imputer_mode = SimpleImputer(strategy='most_frequent')

# Constant imputation
imputer_constant = SimpleImputer(strategy='constant', fill_value=0)

# Example: Using median for salary column
X_copy = X.copy()
X_copy[:, 2:3] = imputer_median.fit_transform(X_copy[:, 2:3])
                

Important: Always fit the imputer on training data only, then transform both training and test data to avoid data leakage.

Encoding Categorical Data

Why Encode Categorical Data?

Machine learning algorithms work with numerical data. Categorical data (text labels) must be converted to numerical format.

Types of Categorical Data

Nominal: No inherent order (Country: France, Spain, Germany)
Ordinal: Has order (Education: High School < Bachelor < Master < PhD)

1. Label Encoding

Converts categories to integers (0, 1, 2, ...)

from sklearn.preprocessing import LabelEncoder

# Encode the target variable (Purchased: No=0, Yes=1)
label_encoder_y = LabelEncoder()
y = label_encoder_y.fit_transform(y)

print("Original target values:", dataset['Purchased'].unique())
print("Encoded target values:", np.unique(y))
print("Mapping: No=0, Yes=1")
                

2. One-Hot Encoding

Creates binary columns for each category (recommended for nominal data)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# One-hot encode the Country column (column 0)
ct = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(), [0])],
    remainder='passthrough'
)

X = np.array(ct.fit_transform(X))

print("Shape after one-hot encoding:", X.shape)
print("First 5 rows after encoding:")
print(X[:5])

# The columns are now: [France, Germany, Spain, Age, Salary]
                

3. Manual One-Hot Encoding with Pandas

# Alternative method using pandas
df = pd.DataFrame(dataset)

# Create dummy variables for Country
country_dummies = pd.get_dummies(df['Country'], prefix='Country')
print("Country dummy variables:")
print(country_dummies.head())

# Combine with other features
df_encoded = pd.concat([country_dummies, df[['Age', 'Salary']]], axis=1)
print("\nFinal encoded dataset:")
print(df_encoded.head())
                

Dummy Variable Trap: When using one-hot encoding, you can drop one column to avoid multicollinearity (n categories → n-1 columns).

Splitting Dataset into Train-Test Sets

Why Split the Data?

Training Set: Used to train the model
Test Set: Used to evaluate model performance on unseen data
Validation Set: Used for hyperparameter tuning (optional)

Common Split Ratios

80-20: 80% training, 20% testing
70-30: 70% training, 30% testing
60-20-20: 60% training, 20% validation, 20% testing

Implementation

from sklearn.model_selection import train_test_split

# Split the dataset (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42,    # For reproducible results
    stratify=y          # Maintain class distribution
)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Test target shape:", y_test.shape)

# Check class distribution
print("\nClass distribution in training set:")
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))

print("\nClass distribution in test set:")
unique, counts = np.unique(y_test, return_counts=True)
print(dict(zip(unique, counts)))
                

Advanced Splitting Techniques

# Time series split (for temporal data)
from sklearn.model_selection import TimeSeriesSplit

# K-fold cross-validation split
from sklearn.model_selection import KFold

# Stratified K-fold (maintains class distribution)
from sklearn.model_selection import StratifiedKFold

# Example: 5-fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kfold.split(X_train):
    X_fold_train, X_fold_val = X_train[train_idx], X_train[val_idx]
    y_fold_train, y_fold_val = y_train[train_idx], y_train[val_idx]
    # Train and validate model here
                

Important: Never use test data for any preprocessing decisions. Fit preprocessing on training data only!

Feature Scaling

What is Feature Scaling?

Feature scaling transforms features to similar scales to prevent features with larger ranges from dominating the learning process.

When is Feature Scaling Needed?

Distance-based algorithms: KNN, K-means, SVM
Gradient-based algorithms: Neural networks, logistic regression
Regularized algorithms: Ridge, Lasso regression

When Feature Scaling is NOT Needed

Tree-based algorithms: Decision trees, Random Forest, XGBoost
Naive Bayes

Example: Why Scaling Matters

# Example data showing the scale difference
print("Sample data before scaling:")
print("Age range:", X_train[:, -2].min(), "to", X_train[:, -2].max())
print("Salary range:", X_train[:, -1].min(), "to", X_train[:, -1].max())

# Age: 27-50, Salary: 48000-83000
# Without scaling, salary will dominate distance calculations!
                

Types of Feature Scaling

There are two main approaches: Standardization and Normalization

Standardization (Z-score Normalization)

Formula

z = (x - μ) / σ

μ = mean of the feature
σ = standard deviation of the feature
Result: mean = 0, standard deviation = 1

When to Use Standardization

Data follows normal distribution
Features have different units (age vs salary)
Presence of outliers (standardization is less sensitive)
Most common choice for feature scaling

Implementation

from sklearn.preprocessing import StandardScaler

# Create StandardScaler object
sc = StandardScaler()

# Fit on training data and transform both training and test data
# Only scale numerical features (last 2 columns: Age and Salary)
X_train[:, -2:] = sc.fit_transform(X_train[:, -2:])
X_test[:, -2:] = sc.transform(X_test[:, -2:])

print("After standardization:")
print("Training data (first 5 rows):")
print(X_train[:5])

print("\nMean of scaled features (should be ~0):")
print("Age mean:", np.mean(X_train[:, -2]))
print("Salary mean:", np.mean(X_train[:, -1]))

print("\nStd of scaled features (should be ~1):")
print("Age std:", np.std(X_train[:, -2]))
print("Salary std:", np.std(X_train[:, -1]))
                

Manual Standardization

# Manual implementation for understanding
def manual_standardization(X_train, X_test):
    # Calculate mean and std from training data
    mean = np.mean(X_train, axis=0)
    std = np.std(X_train, axis=0)
    
    # Apply to both training and test data
    X_train_scaled = (X_train - mean) / std
    X_test_scaled = (X_test - mean) / std
    
    return X_train_scaled, X_test_scaled

# Example usage
age_salary_train = X_train[:, -2:].astype(float)
age_salary_test = X_test[:, -2:].astype(float)
scaled_train, scaled_test = manual_standardization(age_salary_train, age_salary_test)
                

Normalization (Min-Max Scaling)

Formula

x_norm = (x - min) / (max - min)

Result: values between 0 and 1
Preserves the original distribution shape
All features have the same scale [0, 1]

When to Use Normalization

Data doesn't follow normal distribution
You want to preserve the original distribution
Neural networks (often prefer [0, 1] range)
When you know the approximate upper and lower bounds

Implementation

from sklearn.preprocessing import MinMaxScaler

# Create MinMaxScaler object
scaler = MinMaxScaler()

# Reset data for demonstration
X_train_norm = X_train.copy()
X_test_norm = X_test.copy()

# Fit on training data and transform both sets
X_train_norm[:, -2:] = scaler.fit_transform(X_train_norm[:, -2:])
X_test_norm[:, -2:] = scaler.transform(X_test_norm[:, -2:])

print("After normalization:")
print("Training data (first 5 rows):")
print(X_train_norm[:5])

print("\nMin values (should be ~0):")
print("Age min:", np.min(X_train_norm[:, -2]))
print("Salary min:", np.min(X_train_norm[:, -1]))

print("\nMax values (should be ~1):")
print("Age max:", np.max(X_train_norm[:, -2]))
print("Salary max:", np.max(X_train_norm[:, -1]))
                

Custom Range Normalization

# Normalize to custom range [a, b]
scaler_custom = MinMaxScaler(feature_range=(-1, 1))  # Range [-1, 1]

# Or manual implementation
def normalize_to_range(X, min_val=0, max_val=1):
    X_min = np.min(X, axis=0)
    X_max = np.max(X, axis=0)
    X_scaled = (X - X_min) / (X_max - X_min)
    return X_scaled * (max_val - min_val) + min_val

# Example: normalize to [-1, 1]
X_custom = normalize_to_range(X_train[:, -2:], -1, 1)
print("Custom range [-1, 1]:")
print("Min:", np.min(X_custom, axis=0))
print("Max:", np.max(X_custom, axis=0))
                

Standardization vs Normalization Comparison

# Comparison table
import pandas as pd

comparison = pd.DataFrame({
    'Aspect': ['Output Range', 'Distribution', 'Outlier Sensitivity', 'Use Case'],
    'Standardization': ['(-∞, +∞)', 'Normal (μ=0, σ=1)', 'Less sensitive', 'Most algorithms'],
    'Normalization': ['[0, 1]', 'Preserves original', 'More sensitive', 'Neural networks, bounded data']
})

print(comparison.to_string(index=False))
                

Key Rule: Always fit the scaler on training data only, then transform both training and test data using the same scaler parameters.

Summary

Complete Data Preprocessing Pipeline

# Complete preprocessing pipeline example
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

def preprocess_data(file_path):
    # 1. Load dataset
    dataset = pd.read_csv(file_path)
    
    # 2. Separate features and target
    X = dataset.iloc[:, :-1].values
    y = dataset.iloc[:, -1].values
    
    # 3. Handle missing data
    imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
    X[:, 1:3] = imputer.fit_transform(X[:, 1:3])
    
    # 4. Encode categorical data
    # One-hot encode independent variable
    ct = ColumnTransformer(
        transformers=[('encoder', OneHotEncoder(), [0])],
        remainder='passthrough'
    )
    X = np.array(ct.fit_transform(X))
    
    # Label encode dependent variable
    le = LabelEncoder()
    y = le.fit_transform(y)
    
    # 5. Split dataset
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # 6. Feature scaling
    sc = StandardScaler()
    X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
    X_test[:, 3:] = sc.transform(X_test[:, 3:])
    
    return X_train, X_test, y_train, y_test

# Usage
# X_train, X_test, y_train, y_test = preprocess_data('customers.csv')
                

Key Takeaways

Data Quality: Good preprocessing is crucial for model performance
Missing Data: Choose appropriate strategy based on data type and missingness pattern
Categorical Encoding: Use label encoding for ordinal, one-hot for nominal data
Train-Test Split: Always split before preprocessing to avoid data leakage
Feature Scaling: Choose standardization or normalization based on data distribution and algorithm requirements
Fit-Transform Rule: Fit on training data, transform both training and test data

Best Practices

Understand your data first (EDA - Exploratory Data Analysis)
Handle missing data appropriately
Choose encoding methods based on data type
Split data before any preprocessing
Scale features when necessary
Document your preprocessing steps
Create reusable preprocessing pipelines

Next Steps: After preprocessing, your data is ready for machine learning algorithms. The quality of preprocessing directly impacts model performance, so invest time in this crucial step!