Pradip G. Vanparia: Data Wrangling

What is Data Wrangling?

Data wrangling means cleaning and preparing raw data so that it becomes useful for analysis and machine learning.

It is also called:

Data Preprocessing
Data Cleaning
Data Munging

Example

Imagine you get raw data like this:

Name	Age	Salary
Ram	25	50000
Sita		60000
John	300	70000

Problems

Age missing for Sita ❌
Age 300 for John (wrong value) ❌

So we clean it and correct it.

Complete Process of Data Wrangling

Step 1: Discovering (Understanding the Data)

What we do:

Check where data comes from
Understand columns
Understand data type
Ask questions about data

Example:

If you get sales data, ask:

How many sales happened?
Which product sold more?
Which month had highest sales?

Step 2: Cleaning

What we fix:

Missing values
Wrong values
Duplicate rows
Very large or very small values
Format issues

Handling Missing Values

Name	Marks
A	80
B
C	75

What to do?

If many values missing:

Fill with Mean
Fill with Median
Fill with Mode

If only 1 row missing:

Delete that row

Step 3: Data Validation

What we check:

Is data correct?
Is data logical?
Is data trustworthy?

Example:

Age cannot be 300
Salary cannot be negative

We correct such errors.

Step 4: Structuring

We format data properly so:

Machine learning models can use it
No confusion in columns
Proper structure

Example:

Convert text date into datetime format
Rename columns properly

EDA (Exploratory Data Analysis)

EDA means exploring data to understand it better. Like exploring a new city.

EDA Cycle

Ask Question
Check Data
Find Answer
Ask New Question
Repeat

Example: Sales Data

First Question: How many sales this year?
After checking: Sales increased!
Second Question: Why sales increased?

Maybe:

Marketing improved
New product launched
Festival season

This cycle continues.

ETL and ELT

ETL = Extract → Transform → Load

Used for structured data.

Extract data from source
Transform (clean, modify)
Load into database

ELT = Extract → Load → Transform

Used for unstructured or semi-structured data.

Extract
Load directly
Transform inside system

Feature Engineering

Feature = Column in dataset

Discipline	Hard Work	Smart Work	Milk	Score

If milk does not affect score, remove milk column.

Why remove features?

Less memory needed
Faster model
Less complexity
Better accuracy

Real Example:

Discipline ✅
Hard Work ✅
Smart Work ✅
Drink Milk ❌ (Not important)

So remove "Milk".

Data Normalization

Problem Example:

Data: 1, 2, 100

1 to 2 = 1
2 to 100 = 98 ❌

Model gets confused.

Solution: Normalize (0 to 1)

After normalization: 0.01, 0.02, 0.97

Now all values between 0 and 1 ✅

Complete Data Life Cycle

Discover Questions
Acquire Data
Clean Data
Explore Data (EDA)
Feature Engineering
Build Machine Learning Model
Data Visualization
Repeat Cycle

It never stops. New data → Repeat again.

Simple Full Example (End-to-End)

Suppose you want to predict crop price.

Step 1: Ask Questions

Does rainfall affect price?
Does production affect price?

Step 2: Get Data

Rainfall
Production
Min price
Max price
Modal price

Step 3: Clean Data

Remove missing rainfall values
Fix wrong prices

Step 4: Normalize Data

Scale rainfall
Scale production

Step 5: Feature Engineering

Remove unnecessary column
Create new feature (Price Range = Max - Min)

Step 6: Train Model

Linear Regression
Random Forest

Step 7: Visualize Results

Graph
Correlation heatmap

19/02/2026

Data Wrangling

What is Data Wrangling?

Example

Problems

Complete Process of Data Wrangling

Step 1: Discovering (Understanding the Data)

Step 2: Cleaning

Handling Missing Values

Step 3: Data Validation

Step 4: Structuring

EDA (Exploratory Data Analysis)

ETL and ELT

Feature Engineering

Why remove features?

Data Normalization

Complete Data Life Cycle

Simple Full Example (End-to-End)

Step 1: Ask Questions

Step 2: Get Data

Step 3: Clean Data

Step 4: Normalize Data

Step 5: Feature Engineering

Step 6: Train Model

Step 7: Visualize Results

Popular Posts

Followers