19/02/2026

Handling Missing Data

What are Missing Values?

Missing values means:

  • Blank values in dataset
  • Data that is not recorded
  • Data that is lost or not captured

Example:

If a table looks like this:

Name Age Salary
Ram 25 30000
Shyam 28000
Rita 30

Here:

  • Age of Shyam is missing
  • Salary of Rita is missing
These blank cells are called missing values.

Why Missing Values Occur?

1. Data Corruption

  • Data stored for long time
  • Virus or malware attack
  • File damaged
  • Some part of data deleted

Example:

  • Excel file corrupted
  • Some rows deleted automatically

2. Failure to Record Data

Sometimes data is not captured properly.

Example:

  • Sensor not working properly
  • Temperature sensor not recording
  • Mobile touch sensor not responding
  • Face recognition not detecting face

So data is not recorded → Missing value

3. Incomplete Extraction

Sometimes data is not fully extracted from source.

Example:

  • Online shopping website collecting user browsing data
  • One API stops working
  • Some browser data not captured

So when we download data → Some columns missing.

4. No Response

Very common in surveys.

Example:

  • Survey has 10 questions
  • User answers 9 questions
  • Skips 10th question because it is personal

So that question becomes blank → Missing value.

Challenges of Missing Data

If dataset contains missing values, we face problems.

1. Reduced Statistical Power

Example:

  • Total records = 1000
  • Missing records = 200

Now only 800 records usable.

  • Model accuracy reduces
  • Analysis becomes weak

2. Bias in Parameter Estimation

In Machine Learning:

  • Every column (feature) is important
  • If values are missing
  • Model cannot properly learn patterns
  • Wrong prediction
  • Biased results

3. Reduced Representativeness of Sample

Population = Entire data
Sample = Small part of population

Example:

  • India population = 100 crore
  • Survey done on 1000 people

If many answers are missing:

  • Sample does not properly represent population
  • Result becomes unreliable

Methods to Handle Missing Values

Method 1: Delete Rows with Missing Values

If missing values are very few:

  • Less than 5% → Safe to delete
  • More than 5% → Not recommended

Example:

df.dropna()

Use only when missing data is very small.

Method 2: Imputation (Substitution)

Imputation means:

Replace missing value with some other value.

There are two types:

A. Continuous Values (Numerical)

Examples:

  • Age
  • Salary
  • Price
  • Temperature
  • Speed

We can replace missing value using:

  • Mean (Average)
  • Median
  • Mode

Example:

If 100 products price = 99
One price missing

Replace missing value with average (99)

Example Code:

from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(strategy="mean")
X = imputer.fit_transform(X)

B. Categorical Values

Examples:

  • Gender (Male/Female)
  • Color (Red/Blue/Green)
  • Day/Night
  • First/Second/Third

We replace using:

Mode (Most frequent value)

Example:
If most people selected "Male"
Missing value → Replace with "Male"

Method 3: Use Algorithms That Support Missing Values

Some ML algorithms can handle missing values automatically.

Example:

  • Certain tree-based models

So we don't need to remove or fill manually.

Method 4: Predict Missing Values

We can:

  • Train model on existing data
  • Predict missing values

Example:

  • Predict missing salary based on age and experience

Method 5: Deep Learning Imputation

Advanced method.

Using:

  • Neural Networks
  • Deep Learning models

To predict and fill missing values.

Used in:

  • Large datasets
  • Complex data

Example of Crop Price Dataset

Your dataset may contain columns like:

  • Date
  • Crop_Name
  • Min_Price
  • Max_Price
  • Modal_Price
  • Rainfall
  • Temperature
  • Production
  • Yield

This type of dataset is used for Crop Price Forecasting.

Step 1: Open Google Colab

  • Go to: https://colab.research.google.com
  • Click New Notebook
  • Click Connect

Step 2: Import Library

import pandas as pd

Step 3: Upload Crop Dataset

  • Click folder icon → Upload → Select your file (example: crop_price.csv)

Then load dataset:

data = pd.read_csv('/content/crop_price.csv')
data

Step 4: Check Dataset Information

data.info()

This shows:

  • Total rows
  • Total columns
  • Missing values
  • Data types

What are Missing Values?

Missing values are:

  • NaN
  • Null
  • NA
  • Blank values

Example in Crop Dataset

Date        Min_Price   Max_Price   Rainfall
01-01-2023  4500        5200        23
02-01-2023  NaN         5100        25
03-01-2023  4400        NaN         NaN

Step 5: Identify Missing Values

Method 1

data.isna()

Method 2

data.isnull()

Both give the same result:

  • True → Missing value
  • False → No missing value

Step 6: Count Missing Values (Column-wise)

data.isna().sum()

Example Output

Column        Missing Values
Min_Price     5
Max_Price     3
Modal_Price   4
Rainfall      6
Temperature   2

Step 7: Drop Missing Values

If missing values are very few:

data.dropna()

Before: 1000 rows
After: 950 rows

Note: Not a permanent change.

Permanent Drop

data.dropna(inplace=True)

Important Parameters in dropna()

1. axis

  • axis=0 → Drop rows (default)
  • axis=1 → Drop columns
data.dropna(axis=1)

2. how

  • how='any' → Drop row if ANY null present
  • how='all' → Drop only if ALL values are null
data.dropna(how='all')

3. thresh

Keep row if minimum non-null values are present.

data.dropna(thresh=6)

Means: Keep row only if at least 6 values are non-null.

4. subset

Drop based on specific column only.

data.dropna(subset=['Modal_Price'])

Step 8: Fill (Impute) Missing Values

Instead of dropping, we can fill them.

1. Fill with Constant Value

data.fillna(0)

2. Fill with Mean (Best for Price Columns)

data['Min_Price'].fillna(data['Min_Price'].mean(), inplace=True)

3. Fill with Median

data['Rainfall'].fillna(data['Rainfall'].median(), inplace=True)

4. Forward Fill (Time Series Data)

Very important for crop price prediction:

data.fillna(method='ffill', inplace=True)

It fills missing value with previous row value.

Interpolation (Recommended for Forecasting)

data.interpolate(method='linear', inplace=True)

This is better for price forecasting.

When to Drop vs Fill?

Situation What to Do
Very few missing values Drop
Many missing values Fill
Time series data Use forward fill / interpolation
Target column missing Carefully handle

Why Missing Value Handling is Important?

  • Model gives wrong prediction
  • Biased results
  • Lower accuracy
  • Higher MSE, RMSE

Practical: Handling Missing Values in Crop Price Dataset (Full Example)

1️⃣ Import Library

import pandas as pd
pandas → Used for data manipulation.
pd → Short name (alias).

2️⃣ Load Excel Dataset

data = pd.read_excel('/content/crop_price.xlsx')
data.head()
read_excel() → Reads Excel file.
head() → Shows first 5 rows.

If error occurs:

!pip install openpyxl

3️⃣ Check Dataset Info

data.info()
Shows:
• Total rows
• Total columns
• Data types
• Missing values count

4️⃣ Identify Missing Values

data.isna()
data.isnull()
Returns True where value is missing.
Both functions are same.

5️⃣ Count Missing Values Column-wise

data.isna().sum()
sum() counts total missing values in each column.

6️⃣ Drop Missing Values

Basic Drop

data.dropna()
Removes rows that contain missing values.
Does NOT permanently change data.

Permanent Drop

data.dropna(inplace=True)
inplace=True → Changes original dataset.

Important Parameters of dropna()

data.dropna(axis=1)
data.dropna(how='all')
data.dropna(thresh=6)
data.dropna(subset=['Modal_Price'])
axis=0 → Drop rows (default)
axis=1 → Drop columns

how='any' → Drop if ANY value missing
how='all' → Drop if ALL values missing

thresh → Keep row only if minimum non-null values exist

subset → Drop rows only where specific column is missing

7️⃣ Fill Missing Values (Imputation)

1️⃣ Fill with Mean

data['Min_Price'].fillna(data['Min_Price'].mean(), inplace=True)
Calculates average and replaces missing values.
Use when data is normally distributed and no extreme outliers.

2️⃣ Fill with Median

data['Rainfall'].fillna(data['Rainfall'].median(), inplace=True)
Calculates middle value.
Best for skewed data and outliers.

Difference Between Mean and Median

Mean Median
Average value Middle value
Affected by outliers Not affected much
Good for normal data Good for skewed data

8️⃣ Forward Fill (Time Series)

data.fillna(method='ffill', inplace=True)
Replaces missing value with previous row value.
Best for time-series data.

9️⃣ Interpolation

data.interpolate(method='linear', inplace=True)
Calculates value between two points.
Best for forecasting or continuous data.

Which Method is BEST?

There is NO single best method.
It depends on data distribution.

How to Decide Scientifically?

data['Rainfall'].skew()
Skew ≈ 0 → Use Mean
Skew > 1 or < -1 → Use Median

Then compare models using:

  • MAE
  • RMSE
Whichever method gives lowest RMSE is BEST.
That is research-level answer 🔥
Share This
Previous Post
Next Post