Pradip G. Vanparia: Handling Missing Data

What are Missing Values?

Missing values means:

Blank values in dataset
Data that is not recorded
Data that is lost or not captured

Example:

If a table looks like this:

Name	Age	Salary
Ram	25	30000
Shyam		28000
Rita	30

Here:

Age of Shyam is missing
Salary of Rita is missing

These blank cells are called missing values.

Why Missing Values Occur?

1. Data Corruption

Data stored for long time
Virus or malware attack
File damaged
Some part of data deleted

Example:

Excel file corrupted
Some rows deleted automatically

2. Failure to Record Data

Sometimes data is not captured properly.

Example:

Sensor not working properly
Temperature sensor not recording
Mobile touch sensor not responding
Face recognition not detecting face

So data is not recorded → Missing value

3. Incomplete Extraction

Sometimes data is not fully extracted from source.

Example:

Online shopping website collecting user browsing data
One API stops working
Some browser data not captured

So when we download data → Some columns missing.

4. No Response

Very common in surveys.

Example:

Survey has 10 questions
User answers 9 questions
Skips 10th question because it is personal

So that question becomes blank → Missing value.

Challenges of Missing Data

If dataset contains missing values, we face problems.

1. Reduced Statistical Power

Example:

Total records = 1000
Missing records = 200

Now only 800 records usable.

Model accuracy reduces
Analysis becomes weak

2. Bias in Parameter Estimation

In Machine Learning:

Every column (feature) is important
If values are missing
Model cannot properly learn patterns
Wrong prediction
Biased results

3. Reduced Representativeness of Sample

Population = Entire data
Sample = Small part of population

Example:

India population = 100 crore
Survey done on 1000 people

If many answers are missing:

Sample does not properly represent population
Result becomes unreliable

Methods to Handle Missing Values

Method 1: Delete Rows with Missing Values

If missing values are very few:

Less than 5% → Safe to delete
More than 5% → Not recommended

Example:

df.dropna()

Use only when missing data is very small.

Method 2: Imputation (Substitution)

Imputation means:

Replace missing value with some other value.

There are two types:

A. Continuous Values (Numerical)

Examples:

Age
Salary
Price
Temperature
Speed

We can replace missing value using:

Mean (Average)
Median
Mode

Example:

If 100 products price = 99
One price missing

Replace missing value with average (99)

Example Code:

from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(strategy="mean")
X = imputer.fit_transform(X)

B. Categorical Values

Examples:

Gender (Male/Female)
Color (Red/Blue/Green)
Day/Night
First/Second/Third

We replace using:

Mode (Most frequent value)

Example:
If most people selected "Male"
Missing value → Replace with "Male"

Method 3: Use Algorithms That Support Missing Values

Some ML algorithms can handle missing values automatically.

Example:

Certain tree-based models

So we don't need to remove or fill manually.

Method 4: Predict Missing Values

We can:

Train model on existing data
Predict missing values

Example:

Predict missing salary based on age and experience

Method 5: Deep Learning Imputation

Advanced method.

Using:

Neural Networks
Deep Learning models

To predict and fill missing values.

Used in:

Large datasets
Complex data

Example of Crop Price Dataset

Your dataset may contain columns like:

Date
Crop_Name
Min_Price
Max_Price
Modal_Price
Rainfall
Temperature
Production
Yield

This type of dataset is used for Crop Price Forecasting.

Step 1: Open Google Colab

Go to: https://colab.research.google.com
Click New Notebook
Click Connect

Step 2: Import Library

import pandas as pd

Step 3: Upload Crop Dataset

Click folder icon → Upload → Select your file (example: crop_price.csv)

Then load dataset:

data = pd.read_csv('/content/crop_price.csv')
data

Step 4: Check Dataset Information

data.info()

This shows:

Total rows
Total columns
Missing values
Data types

What are Missing Values?

Missing values are:

NaN
Null
NA
Blank values

Example in Crop Dataset

Date        Min_Price   Max_Price   Rainfall
01-01-2023  4500        5200        23
02-01-2023  NaN         5100        25
03-01-2023  4400        NaN         NaN

Step 5: Identify Missing Values

Method 1

data.isna()

Method 2

data.isnull()

Both give the same result:

True → Missing value
False → No missing value

Step 6: Count Missing Values (Column-wise)

data.isna().sum()

Example Output

Column        Missing Values
Min_Price     5
Max_Price     3
Modal_Price   4
Rainfall      6
Temperature   2

Step 7: Drop Missing Values

If missing values are very few:

data.dropna()

Before: 1000 rows
After: 950 rows

Note: Not a permanent change.

Permanent Drop

data.dropna(inplace=True)

Important Parameters in dropna()

1. axis

axis=0 → Drop rows (default)
axis=1 → Drop columns

data.dropna(axis=1)

2. how

how='any' → Drop row if ANY null present
how='all' → Drop only if ALL values are null

data.dropna(how='all')

3. thresh

Keep row if minimum non-null values are present.

data.dropna(thresh=6)

Means: Keep row only if at least 6 values are non-null.

4. subset

Drop based on specific column only.

data.dropna(subset=['Modal_Price'])

Step 8: Fill (Impute) Missing Values

Instead of dropping, we can fill them.

1. Fill with Constant Value

data.fillna(0)

2. Fill with Mean (Best for Price Columns)

data['Min_Price'].fillna(data['Min_Price'].mean(), inplace=True)

3. Fill with Median

data['Rainfall'].fillna(data['Rainfall'].median(), inplace=True)

4. Forward Fill (Time Series Data)

Very important for crop price prediction:

data.fillna(method='ffill', inplace=True)

It fills missing value with previous row value.

Interpolation (Recommended for Forecasting)

data.interpolate(method='linear', inplace=True)

This is better for price forecasting.

When to Drop vs Fill?

Situation	What to Do
Very few missing values	Drop
Many missing values	Fill
Time series data	Use forward fill / interpolation
Target column missing	Carefully handle

Why Missing Value Handling is Important?

Model gives wrong prediction
Biased results
Lower accuracy
Higher MSE, RMSE

Practical: Handling Missing Values in Crop Price Dataset (Full Example)

1️⃣ Import Library

import pandas as pd

pandas → Used for data manipulation.

pd → Short name (alias).

2️⃣ Load Excel Dataset

data = pd.read_excel('/content/crop_price.xlsx')
data.head()

read_excel() → Reads Excel file.

head() → Shows first 5 rows.

If error occurs:

!pip install openpyxl

3️⃣ Check Dataset Info

data.info()

Shows:

• Total rows

• Total columns

• Data types

• Missing values count

4️⃣ Identify Missing Values

data.isna()
data.isnull()

Returns True where value is missing.

Both functions are same.

5️⃣ Count Missing Values Column-wise

data.isna().sum()

sum() counts total missing values in each column.

6️⃣ Drop Missing Values

Basic Drop

data.dropna()

Removes rows that contain missing values.

Does NOT permanently change data.

Permanent Drop

data.dropna(inplace=True)

inplace=True → Changes original dataset.

Important Parameters of dropna()

data.dropna(axis=1)
data.dropna(how='all')
data.dropna(thresh=6)
data.dropna(subset=['Modal_Price'])

axis=0 → Drop rows (default)

axis=1 → Drop columns

how='any' → Drop if ANY value missing

how='all' → Drop if ALL values missing

thresh → Keep row only if minimum non-null values exist

subset → Drop rows only where specific column is missing

7️⃣ Fill Missing Values (Imputation)

1️⃣ Fill with Mean

data['Min_Price'].fillna(data['Min_Price'].mean(), inplace=True)

Calculates average and replaces missing values.

Use when data is normally distributed and no extreme outliers.

2️⃣ Fill with Median

data['Rainfall'].fillna(data['Rainfall'].median(), inplace=True)

Calculates middle value.

Best for skewed data and outliers.

Difference Between Mean and Median

Mean	Median
Average value	Middle value
Affected by outliers	Not affected much
Good for normal data	Good for skewed data

8️⃣ Forward Fill (Time Series)

data.fillna(method='ffill', inplace=True)

Replaces missing value with previous row value.

Best for time-series data.

9️⃣ Interpolation

data.interpolate(method='linear', inplace=True)

Calculates value between two points.

Best for forecasting or continuous data.

Which Method is BEST?

There is NO single best method.

It depends on data distribution.

How to Decide Scientifically?

data['Rainfall'].skew()

Skew ≈ 0 → Use Mean

Skew > 1 or < -1 → Use Median

Then compare models using:

MAE
RMSE
R²

Whichever method gives lowest RMSE is BEST.

That is research-level answer 🔥

19/02/2026

Handling Missing Data

What are Missing Values?

Why Missing Values Occur?

1. Data Corruption

2. Failure to Record Data

3. Incomplete Extraction

4. No Response

Challenges of Missing Data

1. Reduced Statistical Power

2. Bias in Parameter Estimation

3. Reduced Representativeness of Sample

Methods to Handle Missing Values

Method 1: Delete Rows with Missing Values

Method 2: Imputation (Substitution)

There are two types:

A. Continuous Values (Numerical)

B. Categorical Values

Method 3: Use Algorithms That Support Missing Values

Method 4: Predict Missing Values

Method 5: Deep Learning Imputation

Example of Crop Price Dataset

Step 1: Open Google Colab

Step 2: Import Library

Step 3: Upload Crop Dataset

Step 4: Check Dataset Information

What are Missing Values?

Example in Crop Dataset

Step 5: Identify Missing Values

Method 1

Method 2

Step 6: Count Missing Values (Column-wise)

Example Output

Step 7: Drop Missing Values

Permanent Drop

Important Parameters in dropna()

1. axis

2. how

3. thresh

4. subset

Step 8: Fill (Impute) Missing Values

1. Fill with Constant Value

2. Fill with Mean (Best for Price Columns)

3. Fill with Median

4. Forward Fill (Time Series Data)

Interpolation (Recommended for Forecasting)

When to Drop vs Fill?

Why Missing Value Handling is Important?

Practical: Handling Missing Values in Crop Price Dataset (Full Example)

1️⃣ Import Library

2️⃣ Load Excel Dataset

3️⃣ Check Dataset Info

4️⃣ Identify Missing Values

5️⃣ Count Missing Values Column-wise

6️⃣ Drop Missing Values

Basic Drop

Permanent Drop

Important Parameters of dropna()

7️⃣ Fill Missing Values (Imputation)

1️⃣ Fill with Mean

2️⃣ Fill with Median

Difference Between Mean and Median

8️⃣ Forward Fill (Time Series)

9️⃣ Interpolation

Which Method is BEST?

How to Decide Scientifically?

Popular Posts

Followers