02 - Python for Data Science - Data Preprocessing

Lesson 2: Data Preprocessing

Lesson 2: Python for Data Science - Data Preprocessing

1. Introduction

Before building machine learning models, we must prepare the dataset so it is clean, consistent, and ready for algorithms. This process is called Data Preprocessing.

2. Important Python Libraries/Modules for ML

Library	Purpose
Pandas	Handling structured data (tables, CSV, Excel)
NumPy	Numerical computations, arrays, matrices
Matplotlib / Seaborn	Data visualization
Scikit-learn	Machine learning preprocessing + models

Tip: Always import Pandas as pd and NumPy as np for consistency.

3. Loading the Dataset

Let’s load the famous Titanic dataset. It is widely used for teaching data preprocessing and classification.

import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.head())

Here are the top 5 rows of the dataset:

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.25		S
2	1	1	Cumings, Mrs. John Bradley	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925		S
4	1	1	Futrelle, Mrs. Jacques Heath	female	35	1	113803	53.1	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.05		S

4. Exploring Issues in Data

When working with real datasets, you should check for common problems such as:

Missing values (e.g., Age or Cabin not recorded)
Duplicate records (same passenger entered multiple times)
Categorical variables (e.g., Sex, Embarked – algorithms need numbers, not strings)
Outliers (e.g., Fare = 512, which may skew results)
Inconsistent formatting (e.g., "male" vs "Male")
Irrelevant columns (e.g., Ticket number may not help prediction)

Tip: Before jumping to modeling, always audit your dataset for these issues.

5. Handling Missing Values

# Check missing values
print(df.isnull().sum())

Common strategies:

Drop rows/columns with too many missing values
Fill with mean, median, or mode
Use advanced imputation methods (KNN, regression)

6. Handling Duplicates

# Remove duplicates
df = df.drop_duplicates()

Duplicates can bias your model by giving extra weight to repeated samples.

7. Encoding Categorical Variables

Many machine learning models only work with numbers. Encoding means converting text categories into numeric form.

Why Encoding?

Algorithms can’t process strings like "male" or "female".
Encoding ensures all features are numeric.

Types of Encoding

Label Encoding – Converts categories into numbers (male=0, female=1)
One-Hot Encoding – Creates new columns (Sex_male, Sex_female)
Target Encoding – Replaces category with average target value (used carefully)

Example:

Before Encoding:

PassengerId	Sex	Embarked
1	male	S
2	female	C
3	female	S

After One-Hot Encoding:

PassengerId	Sex_male	Sex_female	Embarked_S	Embarked_C
1	1	0	1	0
2	0	1	0	1
3	0	1	1	0

# Using Pandas
df = pd.get_dummies(df, columns=['Sex', 'Embarked'])
print(df.head())

Tip: Use One-Hot Encoding when categories don’t have natural order (like city names).

8. Feature Scaling

Some models (e.g., KNN, SVM) are sensitive to scale. Use:

Min-Max Scaling: scales values between 0 and 1
Standardization: mean=0, std=1

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

9. Consolidated Preprocessing Code (End-to-End)

import pandas as pd
from sklearn.preprocessing import StandardScaler

# --- Load dataset ---
df = pd.read_csv("titanic.csv")

# --- Explore dataset ---
print(df.head())
print(df.info())
print(df.isnull().sum())
print(df.duplicated().sum())

# --- Handle missing values ---
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# --- Remove duplicates ---
df = df.drop_duplicates()

# --- Encode categorical variables ---
df['Sex'] = df['Sex'].map({'male':0,'female':1})
df = pd.get_dummies(df, columns=['Embarked'])

# --- Drop irrelevant columns ---
df.drop(['PassengerId','Name','Ticket','Cabin'], axis=1, inplace=True)

# --- Feature scaling ---
scaler = StandardScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])

# --- Verify clean dataset ---
print(df.head())
print(df.info())

9. Conclusion

After this preprocessing, the Titanic dataset is clean, encoded, scaled, and ready for feature engineering..

← Previous: 01 - Introduction to Machine Learning ← Previous: Next: 02.1 - Cheatsheet → Next: →