Pradip G. Vanparia: Splitting Dataset

1️⃣ Why Do We Split Data?

When we build a Machine Learning model, we must:

Train the model → So it can learn patterns
Test the model → So we can check how well it learned

If we test on the same data used for training:

Model may give 100% accuracy
But this is not real performance
It just memorized the data

This problem is called Overfitting.

So we divide data into:

Training Set → To teach the model
Testing Set → To evaluate the model

2️⃣ Simple Human Example

Imagine:

we study 100 math problems → (Training Data)
In exam, teacher gives 30 new problems → (Testing Data)

our exam marks show how much we actually learned.

Same way:

Model learns from old data
Predicts on new unseen data
We calculate accuracy in percentage

3️⃣ Standard Data Split Ratio

70% – 80% → Training Data
20% – 30% → Testing Data

Example:

If dataset has 1000 rows:

800 rows → Training
200 rows → Testing

Training data is always larger because the model needs more data to learn properly.

4️⃣ Practical Example (Crop Price Dataset)

Year	Rainfall	WPI
2012	800	150
2013	750	160

Where:

Rainfall → Input (X)
WPI → Output (y)

5️⃣ Method 1: Using SKLearn (Recommended)

from sklearn.model_selection import train_test_split

Step 1️⃣: Load Dataset

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("Cotton.csv")

Step 2️⃣: Define Features (X) and Target (y)

X = data[['Rainfall']]
y = data['WPI']

Step 3️⃣: Split Data

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

Parameter Explanation

Parameter	Meaning
X, y	Input and output data
test_size=0.2	20% data for testing
random_state=42	Keeps result same every time

Step 4️⃣: Check Shape

print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

🔹 6️⃣ What Happens Internally?

Data is shuffled randomly
80% goes to training
20% goes to testing
Model learns from training
Model predicts on testing
Accuracy is calculated

🔹 7️⃣ Manual Splitting (Without SKLearn)

train_size = int(0.8 * len(data))

train = data[:train_size]
test = data[train_size:]

Problem:

No automatic shuffling
May cause biased split

So SKLearn method is better.

🔹 8️⃣ Accuracy Check Example

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("R2 Score:", r2_score(y_test, y_pred))

This tells how much model learned from training data.

🔹 9️⃣ Important Terminologies

Generalization → Model performs well on unseen data
Overfitting → Good on training, poor on testing
Underfitting → Poor on both training and testing
Data Leakage → Testing data used during training

🔹 🔟 Why Train-Test Split is Necessary?

To measure real performance
To avoid overfitting
To test model reliability
To simulate real-world prediction
Required for research publication

Without proper split:

Results are misleading
Model is not trustworthy

🔹 1️⃣ Important Case: Time Series Data

In time-series models like ARIMA, LSTM, GRU:

Do NOT randomly split data.

Correct Chronological Split Example:

Year	Price
2001	100
2002	120
2003	140
2004	150

Correct split:

2001–2003 → Training
2004 → Testing

🔹 2️⃣ When Not to Use Simple Train-Test Split?

Cross Validation (Small dataset)
K-Fold Cross Validation
TimeSeriesSplit (Forecasting)
Walk Forward Validation (Advanced time-series)

Train–Test Splitting on Crop Price Dataset

🔹 Step 1: Import Libraries


import pandas as pd
from sklearn.model_selection import train_test_split

✅ Explanation

1️⃣ import pandas as pd

pandas is used for handling structured data (tables).
pd is a short name (alias).

2️⃣ from sklearn.model_selection import train_test_split

Imports function used for automatic data splitting.
Part of the scikit-learn library.

Step 2: Load Excel File


data = pd.read_excel("/content/drive/MyDrive/cropdata/monthly_average_prices_with_month_year_2025_test.xlsx")

✅ Explanation

pd.read_excel()

Reads Excel file into DataFrame.
Converts Excel sheet into table format.

Now check:


data.head()

Shows first 5 rows.
Used to check whether data loaded correctly.

Step 3: Handle Missing Values


data['Min Price (Rs./Quintal)'] = data['Min Price (Rs./Quintal)'].fillna(
    data['Min Price (Rs./Quintal)'].mean()
)

✅ Explanation

data['column'] → Selects specific column.
.mean() → Calculates average of that column.
.fillna(value) → Replaces missing values (NaN).

👉 Replace missing Min Price with average Min Price.

Then check:


data.isnull().sum()

isnull() → Checks missing values.
sum() → Counts missing values.
If all 0 → No missing data ✔

Step 4: Convert Categorical to Numeric


from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['Commodity'] = le.fit_transform(data['Commodity'])

✅ Explanation

Machine learning cannot understand text like "Cotton".

LabelEncoder() converts text into numbers.

Commodity	After Encoding
Cotton	0
Wheat	1
Rice	2

Step 5: Separate X and Y


X = data[['Commodity', 'Year', 'Month',
          'Min Price (Rs./Quintal)',
          'Max Price (Rs./Quintal)']]

Y = data['Modal Price (Rs./Quintal)']

✅ Explanation

X → Features / Input variables.
Y → Target / Output variable.

Step 6: Manual Splitting


X_train = X.iloc[:80]
X_test = X.iloc[80:]

Y_train = Y.iloc[:80]
Y_test = Y.iloc[80:]

✅ Explanation

iloc → Index-based selection.
:80 → Rows 0 to 79.
80: → Rows 80 to end.

Variable	Meaning
X_train	First 80 rows (training input)
X_test	Remaining rows (testing input)
Y_train	First 80 outputs
Y_test	Remaining outputs

🔴 Problem: Manual splitting may cause bias if data is sorted.

Step 7: Automatic Splitting (Recommended)


X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y,
    test_size=0.2,
    random_state=42
)

✅ Explanation

train_test_split() splits data randomly.
test_size=0.2 → 20% testing, 80% training.
random_state=42 → Keeps result consistent.

Step 8: Check Shapes


print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

Example Output:


(228, 5)
(57, 5)
(228,)
(57,)

228 training rows
57 testing rows
5 features
Y has single column

🔥 Manual vs Automatic Split

Manual Split	Automatic Split
Sequential	Random
Risk of bias	Balanced
Fixed row count	Percentage based
Not recommended generally	Industry standard

🎯 Important Concept

Variable	Meaning
X_train	Input used to train model
Y_train	Correct answers for training
X_test	New input for testing
Y_test	Real answers for checking accuracy

19/02/2026

Splitting Dataset

1️⃣ Why Do We Split Data?

2️⃣ Simple Human Example

3️⃣ Standard Data Split Ratio

4️⃣ Practical Example (Crop Price Dataset)

5️⃣ Method 1: Using SKLearn (Recommended)

Step 1️⃣: Load Dataset

Step 2️⃣: Define Features (X) and Target (y)

Step 3️⃣: Split Data

Parameter Explanation

Step 4️⃣: Check Shape

🔹 6️⃣ What Happens Internally?

🔹 7️⃣ Manual Splitting (Without SKLearn)

🔹 8️⃣ Accuracy Check Example

🔹 9️⃣ Important Terminologies

🔹 🔟 Why Train-Test Split is Necessary?

🔹 1️⃣ Important Case: Time Series Data

🔹 2️⃣ When Not to Use Simple Train-Test Split?

Train–Test Splitting on Crop Price Dataset

🔹 Step 1: Import Libraries

✅ Explanation

Step 2: Load Excel File

✅ Explanation

Step 3: Handle Missing Values

✅ Explanation

Step 4: Convert Categorical to Numeric

✅ Explanation

Step 5: Separate X and Y

✅ Explanation

Step 6: Manual Splitting

✅ Explanation

Step 7: Automatic Splitting (Recommended)

✅ Explanation

Step 8: Check Shapes

🔥 Manual vs Automatic Split

🎯 Important Concept

Popular Posts

Followers