19/02/2026

Data Wrangling

What is Data Wrangling?

Data wrangling means cleaning and preparing raw data so that it becomes useful for analysis and machine learning.

It is also called:

  1. Data Preprocessing
  2. Data Cleaning
  3. Data Munging

Example

Imagine you get raw data like this:

Name Age Salary
Ram 25 50000
Sita 60000
John 300 70000

Problems

  • Age missing for Sita ❌
  • Age 300 for John (wrong value) ❌
So we clean it and correct it.

Complete Process of Data Wrangling

Step 1: Discovering (Understanding the Data)

What we do:

  • Check where data comes from
  • Understand columns
  • Understand data type
  • Ask questions about data

Example:

If you get sales data, ask:

  • How many sales happened?
  • Which product sold more?
  • Which month had highest sales?

Step 2: Cleaning

What we fix:

  • Missing values
  • Wrong values
  • Duplicate rows
  • Very large or very small values
  • Format issues

Handling Missing Values

Name Marks
A 80
B
C 75

What to do?

If many values missing:

  • Fill with Mean
  • Fill with Median
  • Fill with Mode

If only 1 row missing:

  • Delete that row

Step 3: Data Validation

What we check:

  • Is data correct?
  • Is data logical?
  • Is data trustworthy?

Example:

  • Age cannot be 300
  • Salary cannot be negative
We correct such errors.

Step 4: Structuring

We format data properly so:

  • Machine learning models can use it
  • No confusion in columns
  • Proper structure

Example:

  • Convert text date into datetime format
  • Rename columns properly

EDA (Exploratory Data Analysis)

EDA means exploring data to understand it better. Like exploring a new city.

EDA Cycle

  1. Ask Question
  2. Check Data
  3. Find Answer
  4. Ask New Question
  5. Repeat

Example: Sales Data

  • First Question: How many sales this year?
  • After checking: Sales increased!
  • Second Question: Why sales increased?

Maybe:

  • Marketing improved
  • New product launched
  • Festival season

This cycle continues.

ETL and ELT

ETL = Extract → Transform → Load

Used for structured data.

  1. Extract data from source
  2. Transform (clean, modify)
  3. Load into database

ELT = Extract → Load → Transform

Used for unstructured or semi-structured data.

  1. Extract
  2. Load directly
  3. Transform inside system

Feature Engineering

Feature = Column in dataset

Discipline Hard Work Smart Work Milk Score

If milk does not affect score, remove milk column.

Why remove features?

  • Less memory needed
  • Faster model
  • Less complexity
  • Better accuracy

Real Example:

  • Discipline ✅
  • Hard Work ✅
  • Smart Work ✅
  • Drink Milk ❌ (Not important)

So remove "Milk".

Data Normalization

Problem Example:

Data: 1, 2, 100

  • 1 to 2 = 1
  • 2 to 100 = 98 ❌

Model gets confused.

Solution: Normalize (0 to 1)

After normalization: 0.01, 0.02, 0.97

Now all values between 0 and 1 ✅

Complete Data Life Cycle

  • Discover Questions
  • Acquire Data
  • Clean Data
  • Explore Data (EDA)
  • Feature Engineering
  • Build Machine Learning Model
  • Data Visualization
  • Repeat Cycle

It never stops. New data → Repeat again.

Simple Full Example (End-to-End)

Suppose you want to predict crop price.

Step 1: Ask Questions

  • Does rainfall affect price?
  • Does production affect price?

Step 2: Get Data

  • Rainfall
  • Production
  • Min price
  • Max price
  • Modal price

Step 3: Clean Data

  • Remove missing rainfall values
  • Fix wrong prices

Step 4: Normalize Data

  • Scale rainfall
  • Scale production

Step 5: Feature Engineering

  • Remove unnecessary column
  • Create new feature (Price Range = Max - Min)

Step 6: Train Model

  • Linear Regression
  • Random Forest

Step 7: Visualize Results

  • Graph
  • Correlation heatmap
Data Life Cycle Diagram
Share This
Previous Post
First