19/02/2026

Normalizing the Data

1️⃣ What is Normalization?

Normalization means bringing all values to the same scale.


2️⃣ Why Normalization is Needed?

Suppose a column has values:

1, 2, 100

Problem:

  • 100 is very large
  • 1 and 2 are very small
  • ML algorithm will give more importance to 100

So:

  • ❌ Large values dominate
  • ❌ Small values become less important

This reduces model performance.


3️⃣ Solution → Rescaling

Convert values into a common range like:

  • ✔ 0 to 1
  • ✔ -1 to 1
  • ✔ Around 0

This process is called Normalization / Rescaling.


4️⃣ Example of Rescaling

Original Values:

1, 2, 100

After Scaling (0 to 1 range):

1   → 0.0097
2   → 0.0194
100 → 0.9708
  • ✔ All values are between 0 and 1
  • ✔ No value dominates too much

5️⃣ Types of Normalization Techniques

There are 3 main techniques:

  • Maximum Absolute Scaling
  • Min-Max Scaling
  • Standardization (Z-score)

1️⃣ Maximum Absolute Scaling

Formula:

Xnew = X / |Xmax|

Meaning:

  • Divide every value by maximum value
  • Ignore + or – sign
  • Final range = -1 to 1

Example:

Values: 8, 9, 10
Maximum = 10

8/10  = 0.8
9/10  = 0.9
10/10 = 1

Range: -1 to 1

✅ When to Use?

  • ✔ When data contains positive and negative values
  • ✔ When you want zero values unchanged
  • ✔ Sparse data (text data, TF-IDF)
  • ✔ Values already centered around 0

❌ Not Good When:

  • Data has extreme outliers
  • Minimum value is important

2️⃣ Min-Max Scaling (Most Common)

Formula:

Xnew = (X - Xmin) / (Xmax - Xmin)

Range: 0 to 1

Example:

Values: 8, 9, 10

Minimum = 8
Maximum = 10
Range = 2

8  → (8-8)/2  = 0
9  → (9-8)/2  = 0.5
10 → (10-8)/2 = 1

✅ When to Use?

  • ✔ Neural Networks
  • ✔ KNN (distance-based models)
  • ✔ Features have different ranges
  • ✔ No extreme outliers
  • ✔ Image pixel data
  • ✔ Crop price data

❌ Not Good When:

  • Data has large outliers (compresses other values)

3️⃣ Standardization (Z-Score)

Formula:

Z = (X - Mean) / Standard Deviation

Steps:

  • Find mean
  • Subtract mean from each value
  • Divide by standard deviation

Example:

Values: 8, 9, 10
Mean = 9

8  → (8-9)/SD
9  → (9-9)/SD
10 → (10-9)/SD
  • ✔ Mean becomes 0
  • ✔ Standard deviation becomes 1
  • ✔ No fixed range
  • ✔ Data centered around 0

✅ When to Use?

  • ✔ Data follows normal distribution
  • ✔ Linear Regression
  • ✔ Logistic Regression
  • ✔ SVM
  • ✔ PCA
  • ✔ Data contains outliers

6️⃣ Main Purpose of Normalization

  • ✔ Reduce difference between large and small values
  • ✔ Improve model accuracy
  • ✔ Faster training
  • ✔ Better distance calculation (KNN, SVM)

7️⃣ When to Use Which?

Technique Range When to Use
Max Absolute -1 to 1 Sparse data
Min-Max 0 to 1 Neural Networks, KNN
Standardization Around 0 Regression, SVM


9️⃣ Is Normalization Required for Dataset?

Step 1️⃣: Analyze Columns

  • Commodity → Categorical (No normalization needed ❌)
  • Year → 2005 (Similar scale, usually not normalized ❌)
  • Month → 1–12 (Small range, optional ❌)
  • Min Price → ~1800–2000 ✅
  • Max Price → ~2000–2200 ✅
  • Modal Price → ~1900–2100 ✅

👉 Since price columns are numerical and may have different ranges across years, normalization is required for some ML models.

✔ Required If Using:

  • KNN
  • SVM
  • Neural Networks
  • Gradient Descent based models

❌ Not Compulsory For:

  • Decision Tree
  • Random Forest
  • XGBoost

Step 2️⃣: Create Dataset

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Commodity': ['Cotton', 'Cotton', 'Cotton', 'Cotton'],
    'Year': [2005, 2005, 2005, 2005],
    'Month': [9, 10, 11, 12],
    'Min Price (Rs./Quintal)': [1838, np.nan, 1922, 1955],
    'Max Price (Rs./Quintal)': [2080, 2136, 1997, 2038],
    'Modal Price (Rs./Quintal)': [1992, 1997, 1964, 2005]
})

print(df)

Step 3️⃣: Handle Missing Values

You have one missing value in Min Price.

df['Min Price (Rs./Quintal)'] = df['Min Price (Rs./Quintal)'].fillna(
    df['Min Price (Rs./Quintal)'].mean()
)

print(df)

Why mean?

  • Keeps price distribution stable
  • Suitable for numeric continuous data

Step 4️⃣: Select Numeric Columns

We do NOT normalize:

  • Commodity
  • Year
  • Month
numeric_cols = [
'Min Price (Rs./Quintal)',
'Max Price (Rs./Quintal)',
'Modal Price (Rs./Quintal)'
]

Step 5️⃣: Apply Normalization Methods

1️⃣ Maximum Absolute Scaling

df_maxabs = df.copy()

for col in numeric_cols:
    df_maxabs[col] = df_maxabs[col] / df_maxabs[col].abs().max()

print(df_maxabs)

✔ Values scaled between 0 and 1


2️⃣ Min-Max Scaling (Most Recommended)

df_minmax = df.copy()

for col in numeric_cols:
    df_minmax[col] = (df_minmax[col] - df_minmax[col].min()) / 
                     (df_minmax[col].max() - df_minmax[col].min())

print(df_minmax)
  • ✔ Values between 0 and 1
  • ✔ Best for crop price prediction

3️⃣ Z-Score Standardization

df_zscore = df.copy()

for col in numeric_cols:
    df_zscore[col] = (df_zscore[col] - df_zscore[col].mean()) / 
                     df_zscore[col].std()

print(df_zscore)
  • ✔ Mean ≈ 0
  • ✔ Some values negative
  • ✔ Good for normally distributed data

Using SKLearn (Recommended)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df_scaled = df.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print(df_scaled)

🔎 Final Conclusion for Your Crop Price Dataset

  • ✔ Normalization is required for Min, Max, and Modal Price
  • ❌ Not required for Commodity, Year, Month
  • ✔ Use MinMaxScaler for LSTM, ANN, SVM
  • ✔ Use StandardScaler for Linear Regression
  • ✔ Not required for Random Forest




Share This
Previous Post
Next Post