Pradip G. Vanparia: Encoding Categorical Data

What is Categorical Data?

Categorical data refers to data that is divided into groups or categories rather than numerical values.

Examples

A, B, C
Male / Female
Kids / Middle Age / Senior
Sweet / Salted
Red / Yellow

These are labels or groups — not measurable numbers.

This type of grouped data is known as Categorical Data.

Types of Categorical Data

Categorical data is mainly of two types:

Nominal Data
Ordinal Data

1. Nominal Data

Nominal data has:

No order
No ranking
No mathematical meaning

Examples

Male / Female
Cat / Dog / Horse
Red / Blue / Green

You cannot perform mathematical operations on them.
Male + Female ❌
Cat – Dog ❌

This is why it is called Nominal Data.

2. Ordinal Data

Ordinal data has a clear order or ranking.

Examples

Strongly Agree → Neutral → Strongly Disagree
First Rank → Second Rank → Third Rank

The categories follow a logical sequence. Therefore, it is called Ordinal Data.

Why Do We Need Encoding?

Machine Learning models do not understand text values.
They only understand numerical values such as 0, 1, 2, etc.
Therefore, we must convert categorical data into numbers.

This conversion process is called Encoding.

Encoding Techniques

The two most commonly used encoding techniques are:

One Hot Encoding
Label Encoding

Other methods include Ordinal Encoding and Target Encoding.

One Hot Encoding

In One Hot Encoding:

Each category gets a separate column.
Values are represented using 0 and 1.
It creates a binary representation.

Example: Animal Dataset

Original Data

Index	Animal
1	Cat
2	Dog
3	Horse

After One Hot Encoding

Index	Cat	Dog	Horse
1	1	0	0
2	0	1	0
3	0	0	1

If there are n categories, One Hot Encoding creates n new columns.

Label Encoding

Label Encoding converts each category into a unique number.

Original Data

Index	Animal
1	Cat
2	Dog
3	Horse

After Label Encoding

Index	Animal
1	0
2	1
3	2

Cat → 0
Dog → 1
Horse → 2

Difference Between One Hot Encoding and Label Encoding

Feature	One Hot Encoding	Label Encoding
Number of Columns	Multiple columns	Single column
Values	0 and 1	0, 1, 2...
Best Used For	Nominal Data	Ordinal Data
Order Problem	No order issue	May create false ranking

Important Note

If Label Encoding is used for nominal data, the model may assume:

Horse (2) > Dog (1) > Cat (0)

But in reality, there is no ranking.

For Nominal Data → Use One Hot Encoding
For Ordinal Data → Use Label Encoding

Practical Implementation in Python

One Hot Encoding

import pandas as pd

data = {'Animal': ['Cat', 'Dog', 'Horse']}
df = pd.DataFrame(data)

pd.get_dummies(df)

Label Encoding

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Animal'] = le.fit_transform(df['Animal'])
print(df)

Sample Dataset

import pandas as pd

data = {
    'Commodity': ['Cotton', 'Rice', 'Wheat', 'Cotton'],
    'Year': [2005, 2005, 2006, 2007],
    'Month': [9, 10, 11, 12],
    'Price': [1992, 1500, 1800, 2100]
}

df = pd.DataFrame(data)

print(df)

Output

Commodity	Year	Month	Price
Cotton	2005	9	1992
Rice	2005	10	1500
Wheat	2006	11	1800
Cotton	2007	12	2100

🔹 PART 1: Label Encoding

Step 1: Import Library

from sklearn.preprocessing import LabelEncoder

Step 2: Create Encoder Object

le = LabelEncoder()

Step 3: Apply Encoding

df['Commodity_Label'] = le.fit_transform(df['Commodity'])

Function Meaning

Function	Meaning
fit()	Learns unique categories
transform()	Converts categories to numbers
fit_transform()	Performs both steps

Mapping

print(dict(zip(le.classes_, le.transform(le.classes_))))

Important: Label Encoding creates order (0,1,2). This may confuse Linear Regression or KNN.

🔹 PART 2: One-Hot Encoding (Using Pandas)

df_onehot = pd.get_dummies(df, columns=['Commodity'])

Function Meaning

Function	Meaning
get_dummies()	Creates binary columns
columns=['Commodity']	Apply encoding to this column

Example Representation

Commodity	Cotton	Rice	Wheat
Cotton	1	0	0
Rice	0	1	0
Wheat	0	0	1

🔹 PART 3: Using Sklearn OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False)

encoded = ohe.fit_transform(df[['Commodity']])

encoded_df = pd.DataFrame(encoded, 
                          columns=ohe.get_feature_names_out())

print(encoded_df)

Function Explanation

Function	Meaning
OneHotEncoder()	Creates encoder object
sparse_output=False	Returns normal array
fit_transform()	Learn + convert
get_feature_names_out()	Gets new column names

🎯 Final Comparison

Feature	Label Encoding	One-Hot Encoding
Output	1 column	Multiple columns
Memory Usage	Low	Higher
Adds Order	Yes ❌	No ✅
Good For	Tree Models	Linear, KNN, SVM

🚀 When To Use What?

Model	Best Encoding
Linear Regression	One-Hot
Logistic Regression	One-Hot
KNN	One-Hot
Random Forest	Label Encoding
XGBoost	Label Encoding

19/02/2026

Encoding Categorical Data

What is Categorical Data?

Examples

Types of Categorical Data

1. Nominal Data

Examples

2. Ordinal Data

Examples

Why Do We Need Encoding?

Encoding Techniques

One Hot Encoding

Example: Animal Dataset

Original Data

After One Hot Encoding

Label Encoding

Original Data

After Label Encoding

Difference Between One Hot Encoding and Label Encoding

Important Note

Practical Implementation in Python

One Hot Encoding

Label Encoding

Sample Dataset

Output

🔹 PART 1: Label Encoding

Step 1: Import Library

Step 2: Create Encoder Object

Step 3: Apply Encoding

Function Meaning

Mapping

🔹 PART 2: One-Hot Encoding (Using Pandas)

Function Meaning

Example Representation

🔹 PART 3: Using Sklearn OneHotEncoder

Function Explanation

🎯 Final Comparison

🚀 When To Use What?

Popular Posts

Followers