What is Categorical Data?
Categorical data refers to data that is divided into groups or categories rather than numerical values.
Examples
- A, B, C
- Male / Female
- Kids / Middle Age / Senior
- Sweet / Salted
- Red / Yellow
These are labels or groups — not measurable numbers.
This type of grouped data is known as Categorical Data.
Types of Categorical Data
Categorical data is mainly of two types:
- Nominal Data
- Ordinal Data
1. Nominal Data
Nominal data has:
- No order
- No ranking
- No mathematical meaning
Examples
- Male / Female
- Cat / Dog / Horse
- Red / Blue / Green
You cannot perform mathematical operations on them.
Male + Female ❌
Cat – Dog ❌
This is why it is called Nominal Data.
2. Ordinal Data
Ordinal data has a clear order or ranking.
Examples
- Strongly Agree → Neutral → Strongly Disagree
- First Rank → Second Rank → Third Rank
The categories follow a logical sequence. Therefore, it is called Ordinal Data.
Why Do We Need Encoding?
- Machine Learning models do not understand text values.
- They only understand numerical values such as 0, 1, 2, etc.
- Therefore, we must convert categorical data into numbers.
This conversion process is called Encoding.
Encoding Techniques
The two most commonly used encoding techniques are:
- One Hot Encoding
- Label Encoding
Other methods include Ordinal Encoding and Target Encoding.
One Hot Encoding
In One Hot Encoding:
- Each category gets a separate column.
- Values are represented using 0 and 1.
- It creates a binary representation.
Example: Animal Dataset
Original Data
| Index | Animal |
|---|---|
| 1 | Cat |
| 2 | Dog |
| 3 | Horse |
After One Hot Encoding
| Index | Cat | Dog | Horse |
|---|---|---|---|
| 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 0 | 0 | 1 |
If there are n categories, One Hot Encoding creates n new columns.
Label Encoding
Label Encoding converts each category into a unique number.
Original Data
| Index | Animal |
|---|---|
| 1 | Cat |
| 2 | Dog |
| 3 | Horse |
After Label Encoding
| Index | Animal |
|---|---|
| 1 | 0 |
| 2 | 1 |
| 3 | 2 |
Cat → 0
Dog → 1
Horse → 2
Difference Between One Hot Encoding and Label Encoding
| Feature | One Hot Encoding | Label Encoding |
|---|---|---|
| Number of Columns | Multiple columns | Single column |
| Values | 0 and 1 | 0, 1, 2... |
| Best Used For | Nominal Data | Ordinal Data |
| Order Problem | No order issue | May create false ranking |
Important Note
If Label Encoding is used for nominal data, the model may assume:
Horse (2) > Dog (1) > Cat (0)
But in reality, there is no ranking.
- For Nominal Data → Use One Hot Encoding
- For Ordinal Data → Use Label Encoding
Practical Implementation in Python
One Hot Encoding
import pandas as pd
data = {'Animal': ['Cat', 'Dog', 'Horse']}
df = pd.DataFrame(data)
pd.get_dummies(df)
Label Encoding
Sample Dataset
import pandas as pd
data = {
'Commodity': ['Cotton', 'Rice', 'Wheat', 'Cotton'],
'Year': [2005, 2005, 2006, 2007],
'Month': [9, 10, 11, 12],
'Price': [1992, 1500, 1800, 2100]
}
df = pd.DataFrame(data)
print(df)
Output
| Commodity | Year | Month | Price |
|---|---|---|---|
| Cotton | 2005 | 9 | 1992 |
| Rice | 2005 | 10 | 1500 |
| Wheat | 2006 | 11 | 1800 |
| Cotton | 2007 | 12 | 2100 |
🔹 PART 1: Label Encoding
Step 1: Import Library
from sklearn.preprocessing import LabelEncoder
Step 2: Create Encoder Object
le = LabelEncoder()
Step 3: Apply Encoding
df['Commodity_Label'] = le.fit_transform(df['Commodity'])
Function Meaning
| Function | Meaning |
|---|---|
| fit() | Learns unique categories |
| transform() | Converts categories to numbers |
| fit_transform() | Performs both steps |
Mapping
print(dict(zip(le.classes_, le.transform(le.classes_))))
Important: Label Encoding creates order (0,1,2). This may confuse Linear Regression or KNN.
🔹 PART 2: One-Hot Encoding (Using Pandas)
df_onehot = pd.get_dummies(df, columns=['Commodity'])
Function Meaning
| Function | Meaning |
|---|---|
| get_dummies() | Creates binary columns |
| columns=['Commodity'] | Apply encoding to this column |
Example Representation
| Commodity | Cotton | Rice | Wheat |
|---|---|---|---|
| Cotton | 1 | 0 | 0 |
| Rice | 0 | 1 | 0 |
| Wheat | 0 | 0 | 1 |
🔹 PART 3: Using Sklearn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
encoded = ohe.fit_transform(df[['Commodity']])
encoded_df = pd.DataFrame(encoded,
columns=ohe.get_feature_names_out())
print(encoded_df)
Function Explanation
| Function | Meaning |
|---|---|
| OneHotEncoder() | Creates encoder object |
| sparse_output=False | Returns normal array |
| fit_transform() | Learn + convert |
| get_feature_names_out() | Gets new column names |
🎯 Final Comparison
| Feature | Label Encoding | One-Hot Encoding |
|---|---|---|
| Output | 1 column | Multiple columns |
| Memory Usage | Low | Higher |
| Adds Order | Yes ❌ | No ✅ |
| Good For | Tree Models | Linear, KNN, SVM |
🚀 When To Use What?
| Model | Best Encoding |
|---|---|
| Linear Regression | One-Hot |
| Logistic Regression | One-Hot |
| KNN | One-Hot |
| Random Forest | Label Encoding |
| XGBoost | Label Encoding |

