What is Data Wrangling?
Data wrangling means cleaning and preparing raw data so that it becomes useful for analysis and machine learning.
It is also called:
- Data Preprocessing
- Data Cleaning
- Data Munging
Example
Imagine you get raw data like this:
| Name | Age | Salary |
|---|---|---|
| Ram | 25 | 50000 |
| Sita | 60000 | |
| John | 300 | 70000 |
Problems
- Age missing for Sita ❌
- Age 300 for John (wrong value) ❌
Complete Process of Data Wrangling
Step 1: Discovering (Understanding the Data)
What we do:
- Check where data comes from
- Understand columns
- Understand data type
- Ask questions about data
Example:
If you get sales data, ask:
- How many sales happened?
- Which product sold more?
- Which month had highest sales?
Step 2: Cleaning
What we fix:
- Missing values
- Wrong values
- Duplicate rows
- Very large or very small values
- Format issues
Handling Missing Values
| Name | Marks |
|---|---|
| A | 80 |
| B | |
| C | 75 |
What to do?
If many values missing:
- Fill with Mean
- Fill with Median
- Fill with Mode
If only 1 row missing:
- Delete that row
Step 3: Data Validation
What we check:
- Is data correct?
- Is data logical?
- Is data trustworthy?
Example:
- Age cannot be 300
- Salary cannot be negative
Step 4: Structuring
We format data properly so:
- Machine learning models can use it
- No confusion in columns
- Proper structure
Example:
- Convert text date into datetime format
- Rename columns properly
EDA (Exploratory Data Analysis)
EDA means exploring data to understand it better. Like exploring a new city.
EDA Cycle
- Ask Question
- Check Data
- Find Answer
- Ask New Question
- Repeat
Example: Sales Data
- First Question: How many sales this year?
- After checking: Sales increased!
- Second Question: Why sales increased?
Maybe:
- Marketing improved
- New product launched
- Festival season
This cycle continues.
ETL and ELT
ETL = Extract → Transform → Load
Used for structured data.
- Extract data from source
- Transform (clean, modify)
- Load into database
ELT = Extract → Load → Transform
Used for unstructured or semi-structured data.
- Extract
- Load directly
- Transform inside system
Feature Engineering
Feature = Column in dataset
| Discipline | Hard Work | Smart Work | Milk | Score |
|---|
If milk does not affect score, remove milk column.
Why remove features?
- Less memory needed
- Faster model
- Less complexity
- Better accuracy
Real Example:
- Discipline ✅
- Hard Work ✅
- Smart Work ✅
- Drink Milk ❌ (Not important)
So remove "Milk".
Data Normalization
Problem Example:
Data: 1, 2, 100
- 1 to 2 = 1
- 2 to 100 = 98 ❌
Model gets confused.
Solution: Normalize (0 to 1)
After normalization: 0.01, 0.02, 0.97
Now all values between 0 and 1 ✅
Complete Data Life Cycle
- Discover Questions
- Acquire Data
- Clean Data
- Explore Data (EDA)
- Feature Engineering
- Build Machine Learning Model
- Data Visualization
- Repeat Cycle
It never stops. New data → Repeat again.
Simple Full Example (End-to-End)
Suppose you want to predict crop price.
Step 1: Ask Questions
- Does rainfall affect price?
- Does production affect price?
Step 2: Get Data
- Rainfall
- Production
- Min price
- Max price
- Modal price
Step 3: Clean Data
- Remove missing rainfall values
- Fix wrong prices
Step 4: Normalize Data
- Scale rainfall
- Scale production
Step 5: Feature Engineering
- Remove unnecessary column
- Create new feature (Price Range = Max - Min)
Step 6: Train Model
- Linear Regression
- Random Forest
Step 7: Visualize Results
- Graph
- Correlation heatmap
