19/02/2026

Encoding Categorical Data

What is Categorical Data?

Categorical data refers to data that is divided into groups or categories rather than numerical values.

Examples

  • A, B, C
  • Male / Female
  • Kids / Middle Age / Senior
  • Sweet / Salted
  • Red / Yellow

These are labels or groups — not measurable numbers.

This type of grouped data is known as Categorical Data.

Types of Categorical Data

Categorical data is mainly of two types:

  • Nominal Data
  • Ordinal Data

1. Nominal Data

Nominal data has:

  • No order
  • No ranking
  • No mathematical meaning

Examples

  • Male / Female
  • Cat / Dog / Horse
  • Red / Blue / Green

You cannot perform mathematical operations on them.
Male + Female ❌
Cat – Dog ❌

This is why it is called Nominal Data.

2. Ordinal Data

Ordinal data has a clear order or ranking.

Examples

  • Strongly Agree → Neutral → Strongly Disagree
  • First Rank → Second Rank → Third Rank

The categories follow a logical sequence. Therefore, it is called Ordinal Data.

Why Do We Need Encoding?

  • Machine Learning models do not understand text values.
  • They only understand numerical values such as 0, 1, 2, etc.
  • Therefore, we must convert categorical data into numbers.

This conversion process is called Encoding.

Encoding Techniques

The two most commonly used encoding techniques are:

  • One Hot Encoding
  • Label Encoding

Other methods include Ordinal Encoding and Target Encoding.

One Hot Encoding

In One Hot Encoding:

  • Each category gets a separate column.
  • Values are represented using 0 and 1.
  • It creates a binary representation.

Example: Animal Dataset

Original Data

Index Animal
1 Cat
2 Dog
3 Horse

After One Hot Encoding

Index Cat Dog Horse
1 1 0 0
2 0 1 0
3 0 0 1

If there are n categories, One Hot Encoding creates n new columns.

Label Encoding

Label Encoding converts each category into a unique number.

Original Data

Index Animal
1 Cat
2 Dog
3 Horse

After Label Encoding

Index Animal
1 0
2 1
3 2

Cat → 0
Dog → 1
Horse → 2

Difference Between One Hot Encoding and Label Encoding

Feature One Hot Encoding Label Encoding
Number of Columns Multiple columns Single column
Values 0 and 1 0, 1, 2...
Best Used For Nominal Data Ordinal Data
Order Problem No order issue May create false ranking

Important Note

If Label Encoding is used for nominal data, the model may assume:

Horse (2) > Dog (1) > Cat (0)

But in reality, there is no ranking.

  • For Nominal Data → Use One Hot Encoding
  • For Ordinal Data → Use Label Encoding

Practical Implementation in Python

One Hot Encoding

import pandas as pd

data = {'Animal': ['Cat', 'Dog', 'Horse']}
df = pd.DataFrame(data)

pd.get_dummies(df)

Label Encoding

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Animal'] = le.fit_transform(df['Animal'])
print(df)


Sample Dataset

import pandas as pd

data = {
    'Commodity': ['Cotton', 'Rice', 'Wheat', 'Cotton'],
    'Year': [2005, 2005, 2006, 2007],
    'Month': [9, 10, 11, 12],
    'Price': [1992, 1500, 1800, 2100]
}

df = pd.DataFrame(data)

print(df)

Output

Commodity Year Month Price
Cotton 2005 9 1992
Rice 2005 10 1500
Wheat 2006 11 1800
Cotton 2007 12 2100

🔹 PART 1: Label Encoding

Step 1: Import Library

from sklearn.preprocessing import LabelEncoder

Step 2: Create Encoder Object

le = LabelEncoder()

Step 3: Apply Encoding

df['Commodity_Label'] = le.fit_transform(df['Commodity'])

Function Meaning

Function Meaning
fit() Learns unique categories
transform() Converts categories to numbers
fit_transform() Performs both steps

Mapping

print(dict(zip(le.classes_, le.transform(le.classes_))))

Important: Label Encoding creates order (0,1,2). This may confuse Linear Regression or KNN.

🔹 PART 2: One-Hot Encoding (Using Pandas)

df_onehot = pd.get_dummies(df, columns=['Commodity'])

Function Meaning

Function Meaning
get_dummies() Creates binary columns
columns=['Commodity'] Apply encoding to this column

Example Representation

Commodity Cotton Rice Wheat
Cotton 1 0 0
Rice 0 1 0
Wheat 0 0 1

🔹 PART 3: Using Sklearn OneHotEncoder

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False)

encoded = ohe.fit_transform(df[['Commodity']])

encoded_df = pd.DataFrame(encoded, 
                          columns=ohe.get_feature_names_out())

print(encoded_df)

Function Explanation

Function Meaning
OneHotEncoder() Creates encoder object
sparse_output=False Returns normal array
fit_transform() Learn + convert
get_feature_names_out() Gets new column names

🎯 Final Comparison

Feature Label Encoding One-Hot Encoding
Output 1 column Multiple columns
Memory Usage Low Higher
Adds Order Yes ❌ No ✅
Good For Tree Models Linear, KNN, SVM

🚀 When To Use What?

Model Best Encoding
Linear Regression One-Hot
Logistic Regression One-Hot
KNN One-Hot
Random Forest Label Encoding
XGBoost Label Encoding
Share This
Previous Post
Next Post