+91 9873530045
admin@learnwithfrahimcom
Mon - Sat : 09 AM - 09 PM

02 - Python for Data Science - Data Preprocessing

Lesson 2: Data Preprocessing

Lesson 2: Python for Data Science - Data Preprocessing

1. Introduction

Before building machine learning models, we must prepare the dataset so it is clean, consistent, and ready for algorithms. This process is called Data Preprocessing.

2. Important Python Libraries/Modules for ML

LibraryPurpose
PandasHandling structured data (tables, CSV, Excel)
NumPyNumerical computations, arrays, matrices
Matplotlib / SeabornData visualization
Scikit-learnMachine learning preprocessing + models
Tip: Always import Pandas as pd and NumPy as np for consistency.

3. Loading the Dataset

Let’s load the famous Titanic dataset. It is widely used for teaching data preprocessing and classification.

import pandas as pd

df = pd.read_csv("titanic.csv")
print(df.head())

Here are the top 5 rows of the dataset:

PassengerIdSurvivedPclassNameSex AgeSibSpParchTicketFare CabinEmbarked
103Braund, Mr. Owen Harrismale 2210A/5 211717.25S
211Cumings, Mrs. John Bradleyfemale 3810PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale 2600STON/O2. 31012827.925S
411Futrelle, Mrs. Jacques Heathfemale 351011380353.1C123S
503Allen, Mr. William Henrymale 35003734508.05S

4. Exploring Issues in Data

When working with real datasets, you should check for common problems such as:

  • Missing values (e.g., Age or Cabin not recorded)
  • Duplicate records (same passenger entered multiple times)
  • Categorical variables (e.g., Sex, Embarked – algorithms need numbers, not strings)
  • Outliers (e.g., Fare = 512, which may skew results)
  • Inconsistent formatting (e.g., "male" vs "Male")
  • Irrelevant columns (e.g., Ticket number may not help prediction)
Tip: Before jumping to modeling, always audit your dataset for these issues.

5. Handling Missing Values

# Check missing values
print(df.isnull().sum())

Common strategies:

  • Drop rows/columns with too many missing values
  • Fill with mean, median, or mode
  • Use advanced imputation methods (KNN, regression)

6. Handling Duplicates

# Remove duplicates
df = df.drop_duplicates()

Duplicates can bias your model by giving extra weight to repeated samples.

7. Encoding Categorical Variables

Many machine learning models only work with numbers. Encoding means converting text categories into numeric form.

Why Encoding?

  • Algorithms can’t process strings like "male" or "female".
  • Encoding ensures all features are numeric.

Types of Encoding

  • Label Encoding – Converts categories into numbers (male=0, female=1)
  • One-Hot Encoding – Creates new columns (Sex_male, Sex_female)
  • Target Encoding – Replaces category with average target value (used carefully)

Example:

Before Encoding:

PassengerIdSexEmbarked
1maleS
2femaleC
3femaleS

After One-Hot Encoding:

PassengerIdSex_maleSex_femaleEmbarked_SEmbarked_CEmbarked_Q
110100
201010
301100
# Using Pandas
df = pd.get_dummies(df, columns=['Sex', 'Embarked'])
print(df.head())
Tip: Use One-Hot Encoding when categories don’t have natural order (like city names).

8. Feature Scaling

Some models (e.g., KNN, SVM) are sensitive to scale. Use:

  • Min-Max Scaling: scales values between 0 and 1
  • Standardization: mean=0, std=1
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

9. Consolidated Preprocessing Code (End-to-End)

import pandas as pd
from sklearn.preprocessing import StandardScaler

# --- Load dataset ---
df = pd.read_csv("titanic.csv")

# --- Explore dataset ---
print(df.head())
print(df.info())
print(df.isnull().sum())
print(df.duplicated().sum())

# --- Handle missing values ---
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# --- Remove duplicates ---
df = df.drop_duplicates()

# --- Encode categorical variables ---
df['Sex'] = df['Sex'].map({'male':0,'female':1})
df = pd.get_dummies(df, columns=['Embarked'])

# --- Drop irrelevant columns ---
df.drop(['PassengerId','Name','Ticket','Cabin'], axis=1, inplace=True)

# --- Feature scaling ---
scaler = StandardScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])

# --- Verify clean dataset ---
print(df.head())
print(df.info())

9. Conclusion

After this preprocessing, the Titanic dataset is clean, encoded, scaled, and ready for feature engineering..