Before building machine learning models, we must prepare the dataset so it is clean, consistent, and ready for algorithms. This process is called Data Preprocessing.
| Library | Purpose |
|---|---|
| Pandas | Handling structured data (tables, CSV, Excel) |
| NumPy | Numerical computations, arrays, matrices |
| Matplotlib / Seaborn | Data visualization |
| Scikit-learn | Machine learning preprocessing + models |
pd and NumPy as np for consistency.Let’s load the famous Titanic dataset. It is widely used for teaching data preprocessing and classification.
import pandas as pd
df = pd.read_csv("titanic.csv")
print(df.head())
Here are the top 5 rows of the dataset:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath | female | 35 | 1 | 0 | 113803 | 53.1 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S |
When working with real datasets, you should check for common problems such as:
# Check missing values
print(df.isnull().sum())
Common strategies:
# Remove duplicates
df = df.drop_duplicates()
Duplicates can bias your model by giving extra weight to repeated samples.
Many machine learning models only work with numbers. Encoding means converting text categories into numeric form.
Before Encoding:
| PassengerId | Sex | Embarked |
|---|---|---|
| 1 | male | S |
| 2 | female | C |
| 3 | female | S |
After One-Hot Encoding:
| PassengerId | Sex_male | Sex_female | Embarked_S | Embarked_C | Embarked_Q |
|---|---|---|---|---|---|
| 1 | 1 | 0 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 | 1 | 0 |
| 3 | 0 | 1 | 1 | 0 | 0 |
# Using Pandas
df = pd.get_dummies(df, columns=['Sex', 'Embarked'])
print(df.head())
Some models (e.g., KNN, SVM) are sensitive to scale. Use:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
import pandas as pd
from sklearn.preprocessing import StandardScaler
# --- Load dataset ---
df = pd.read_csv("titanic.csv")
# --- Explore dataset ---
print(df.head())
print(df.info())
print(df.isnull().sum())
print(df.duplicated().sum())
# --- Handle missing values ---
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# --- Remove duplicates ---
df = df.drop_duplicates()
# --- Encode categorical variables ---
df['Sex'] = df['Sex'].map({'male':0,'female':1})
df = pd.get_dummies(df, columns=['Embarked'])
# --- Drop irrelevant columns ---
df.drop(['PassengerId','Name','Ticket','Cabin'], axis=1, inplace=True)
# --- Feature scaling ---
scaler = StandardScaler()
df[['Age','Fare']] = scaler.fit_transform(df[['Age','Fare']])
# --- Verify clean dataset ---
print(df.head())
print(df.info())
After this preprocessing, the Titanic dataset is clean, encoded, scaled, and ready for feature engineering..