What are Missing Values?
Missing values means:
- Blank values in dataset
- Data that is not recorded
- Data that is lost or not captured
Example:
If a table looks like this:
| Name | Age | Salary |
|---|---|---|
| Ram | 25 | 30000 |
| Shyam | 28000 | |
| Rita | 30 |
Here:
- Age of Shyam is missing
- Salary of Rita is missing
Why Missing Values Occur?
1. Data Corruption
- Data stored for long time
- Virus or malware attack
- File damaged
- Some part of data deleted
Example:
- Excel file corrupted
- Some rows deleted automatically
2. Failure to Record Data
Sometimes data is not captured properly.
Example:
- Sensor not working properly
- Temperature sensor not recording
- Mobile touch sensor not responding
- Face recognition not detecting face
So data is not recorded → Missing value
3. Incomplete Extraction
Sometimes data is not fully extracted from source.
Example:
- Online shopping website collecting user browsing data
- One API stops working
- Some browser data not captured
So when we download data → Some columns missing.
4. No Response
Very common in surveys.
Example:
- Survey has 10 questions
- User answers 9 questions
- Skips 10th question because it is personal
So that question becomes blank → Missing value.
Challenges of Missing Data
If dataset contains missing values, we face problems.
1. Reduced Statistical Power
Example:
- Total records = 1000
- Missing records = 200
Now only 800 records usable.
- Model accuracy reduces
- Analysis becomes weak
2. Bias in Parameter Estimation
In Machine Learning:
- Every column (feature) is important
- If values are missing
- Model cannot properly learn patterns
- Wrong prediction
- Biased results
3. Reduced Representativeness of Sample
Population = Entire data
Sample = Small part of population
Example:
- India population = 100 crore
- Survey done on 1000 people
If many answers are missing:
- Sample does not properly represent population
- Result becomes unreliable
Methods to Handle Missing Values
Method 1: Delete Rows with Missing Values
If missing values are very few:
- Less than 5% → Safe to delete
- More than 5% → Not recommended
Example:
df.dropna()
Use only when missing data is very small.
Method 2: Imputation (Substitution)
Imputation means:
Replace missing value with some other value.
There are two types:
A. Continuous Values (Numerical)
Examples:
- Age
- Salary
- Price
- Temperature
- Speed
We can replace missing value using:
- Mean (Average)
- Median
- Mode
Example:
If 100 products price = 99
One price missing
Replace missing value with average (99)
Example Code:
from sklearn.impute import SimpleImputer import numpy as np imputer = SimpleImputer(strategy="mean") X = imputer.fit_transform(X)
B. Categorical Values
Examples:
- Gender (Male/Female)
- Color (Red/Blue/Green)
- Day/Night
- First/Second/Third
We replace using:
Mode (Most frequent value)
Example:
If most people selected "Male"
Missing value → Replace with "Male"
Method 3: Use Algorithms That Support Missing Values
Some ML algorithms can handle missing values automatically.
Example:
- Certain tree-based models
So we don't need to remove or fill manually.
Method 4: Predict Missing Values
We can:
- Train model on existing data
- Predict missing values
Example:
- Predict missing salary based on age and experience
Method 5: Deep Learning Imputation
Advanced method.
Using:
- Neural Networks
- Deep Learning models
To predict and fill missing values.
Used in:
- Large datasets
- Complex data
Example of Crop Price Dataset
Your dataset may contain columns like:
- Date
- Crop_Name
- Min_Price
- Max_Price
- Modal_Price
- Rainfall
- Temperature
- Production
- Yield
This type of dataset is used for Crop Price Forecasting.
Step 1: Open Google Colab
- Go to: https://colab.research.google.com
- Click New Notebook
- Click Connect
Step 2: Import Library
import pandas as pd
Step 3: Upload Crop Dataset
- Click folder icon → Upload → Select your file (example: crop_price.csv)
Then load dataset:
data = pd.read_csv('/content/crop_price.csv')
data
Step 4: Check Dataset Information
data.info()
This shows:
- Total rows
- Total columns
- Missing values
- Data types
What are Missing Values?
Missing values are:
- NaN
- Null
- NA
- Blank values
Example in Crop Dataset
Date Min_Price Max_Price Rainfall 01-01-2023 4500 5200 23 02-01-2023 NaN 5100 25 03-01-2023 4400 NaN NaN
Step 5: Identify Missing Values
Method 1
data.isna()
Method 2
data.isnull()
Both give the same result:
- True → Missing value
- False → No missing value
Step 6: Count Missing Values (Column-wise)
data.isna().sum()
Example Output
Column Missing Values Min_Price 5 Max_Price 3 Modal_Price 4 Rainfall 6 Temperature 2
Step 7: Drop Missing Values
If missing values are very few:
data.dropna()
Before: 1000 rows
After: 950 rows
Note: Not a permanent change.
Permanent Drop
data.dropna(inplace=True)
Important Parameters in dropna()
1. axis
- axis=0 → Drop rows (default)
- axis=1 → Drop columns
data.dropna(axis=1)
2. how
- how='any' → Drop row if ANY null present
- how='all' → Drop only if ALL values are null
data.dropna(how='all')
3. thresh
Keep row if minimum non-null values are present.
data.dropna(thresh=6)
Means: Keep row only if at least 6 values are non-null.
4. subset
Drop based on specific column only.
data.dropna(subset=['Modal_Price'])
Step 8: Fill (Impute) Missing Values
Instead of dropping, we can fill them.
1. Fill with Constant Value
data.fillna(0)
2. Fill with Mean (Best for Price Columns)
data['Min_Price'].fillna(data['Min_Price'].mean(), inplace=True)
3. Fill with Median
data['Rainfall'].fillna(data['Rainfall'].median(), inplace=True)
4. Forward Fill (Time Series Data)
Very important for crop price prediction:
data.fillna(method='ffill', inplace=True)
It fills missing value with previous row value.
Interpolation (Recommended for Forecasting)
data.interpolate(method='linear', inplace=True)
This is better for price forecasting.
When to Drop vs Fill?
| Situation | What to Do |
|---|---|
| Very few missing values | Drop |
| Many missing values | Fill |
| Time series data | Use forward fill / interpolation |
| Target column missing | Carefully handle |
Why Missing Value Handling is Important?
- Model gives wrong prediction
- Biased results
- Lower accuracy
- Higher MSE, RMSE
Practical: Handling Missing Values in Crop Price Dataset (Full Example)
1️⃣ Import Library
import pandas as pd
pd → Short name (alias).
2️⃣ Load Excel Dataset
data = pd.read_excel('/content/crop_price.xlsx')
data.head()
head() → Shows first 5 rows.
If error occurs:
!pip install openpyxl
3️⃣ Check Dataset Info
data.info()
• Total rows
• Total columns
• Data types
• Missing values count
4️⃣ Identify Missing Values
data.isna() data.isnull()
Both functions are same.
5️⃣ Count Missing Values Column-wise
data.isna().sum()
6️⃣ Drop Missing Values
Basic Drop
data.dropna()
Does NOT permanently change data.
Permanent Drop
data.dropna(inplace=True)
Important Parameters of dropna()
data.dropna(axis=1) data.dropna(how='all') data.dropna(thresh=6) data.dropna(subset=['Modal_Price'])
axis=1 → Drop columns
how='any' → Drop if ANY value missing
how='all' → Drop if ALL values missing
thresh → Keep row only if minimum non-null values exist
subset → Drop rows only where specific column is missing
7️⃣ Fill Missing Values (Imputation)
1️⃣ Fill with Mean
data['Min_Price'].fillna(data['Min_Price'].mean(), inplace=True)
Use when data is normally distributed and no extreme outliers.
2️⃣ Fill with Median
data['Rainfall'].fillna(data['Rainfall'].median(), inplace=True)
Best for skewed data and outliers.
Difference Between Mean and Median
| Mean | Median |
|---|---|
| Average value | Middle value |
| Affected by outliers | Not affected much |
| Good for normal data | Good for skewed data |
8️⃣ Forward Fill (Time Series)
data.fillna(method='ffill', inplace=True)
Best for time-series data.
9️⃣ Interpolation
data.interpolate(method='linear', inplace=True)
Best for forecasting or continuous data.
Which Method is BEST?
It depends on data distribution.
How to Decide Scientifically?
data['Rainfall'].skew()
Skew > 1 or < -1 → Use Median
Then compare models using:
- MAE
- RMSE
- R²
That is research-level answer 🔥
