19/02/2026

Splitting Dataset

1️⃣ Why Do We Split Data?

When we build a Machine Learning model, we must:

  • Train the model → So it can learn patterns
  • Test the model → So we can check how well it learned

If we test on the same data used for training:

  • Model may give 100% accuracy
  • But this is not real performance
  • It just memorized the data

This problem is called Overfitting.

So we divide data into:

  • Training Set → To teach the model
  • Testing Set → To evaluate the model

2️⃣ Simple Human Example

Imagine:

  • we study 100 math problems → (Training Data)
  • In exam, teacher gives 30 new problems → (Testing Data)

our exam marks show how much we actually learned.

Same way:

  • Model learns from old data
  • Predicts on new unseen data
  • We calculate accuracy in percentage

3️⃣ Standard Data Split Ratio

  • 70% – 80% → Training Data
  • 20% – 30% → Testing Data

Example:

If dataset has 1000 rows:

  • 800 rows → Training
  • 200 rows → Testing

Training data is always larger because the model needs more data to learn properly.

4️⃣ Practical Example (Crop Price Dataset)

Year Rainfall WPI
2012 800 150
2013 750 160

Where:

  • Rainfall → Input (X)
  • WPI → Output (y)

5️⃣ Method 1: Using SKLearn (Recommended)

from sklearn.model_selection import train_test_split

Step 1️⃣: Load Dataset

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv("Cotton.csv")

Step 2️⃣: Define Features (X) and Target (y)

X = data[['Rainfall']]
y = data['WPI']

Step 3️⃣: Split Data

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

Parameter Explanation

Parameter Meaning
X, y Input and output data
test_size=0.2 20% data for testing
random_state=42 Keeps result same every time

Step 4️⃣: Check Shape

print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

🔹 6️⃣ What Happens Internally?

  • Data is shuffled randomly
  • 80% goes to training
  • 20% goes to testing
  • Model learns from training
  • Model predicts on testing
  • Accuracy is calculated

🔹 7️⃣ Manual Splitting (Without SKLearn)

train_size = int(0.8 * len(data))

train = data[:train_size]
test = data[train_size:]

Problem:

  • No automatic shuffling
  • May cause biased split

So SKLearn method is better.

🔹 8️⃣ Accuracy Check Example

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("R2 Score:", r2_score(y_test, y_pred))

This tells how much model learned from training data.

🔹 9️⃣ Important Terminologies

  • Generalization → Model performs well on unseen data
  • Overfitting → Good on training, poor on testing
  • Underfitting → Poor on both training and testing
  • Data Leakage → Testing data used during training

🔹 🔟 Why Train-Test Split is Necessary?

  • To measure real performance
  • To avoid overfitting
  • To test model reliability
  • To simulate real-world prediction
  • Required for research publication

Without proper split:

  • Results are misleading
  • Model is not trustworthy

🔹 1️⃣ Important Case: Time Series Data

In time-series models like ARIMA, LSTM, GRU:

Do NOT randomly split data.

Correct Chronological Split Example:

Year Price
2001 100
2002 120
2003 140
2004 150

Correct split:

  • 2001–2003 → Training
  • 2004 → Testing

🔹 2️⃣ When Not to Use Simple Train-Test Split?

  • Cross Validation (Small dataset)
  • K-Fold Cross Validation
  • TimeSeriesSplit (Forecasting)
  • Walk Forward Validation (Advanced time-series)



Train–Test Splitting on Crop Price Dataset

🔹 Step 1: Import Libraries


import pandas as pd
from sklearn.model_selection import train_test_split

✅ Explanation

1️⃣ import pandas as pd

  • pandas is used for handling structured data (tables).
  • pd is a short name (alias).

2️⃣ from sklearn.model_selection import train_test_split

  • Imports function used for automatic data splitting.
  • Part of the scikit-learn library.

Step 2: Load Excel File


data = pd.read_excel("/content/drive/MyDrive/cropdata/monthly_average_prices_with_month_year_2025_test.xlsx")

✅ Explanation

pd.read_excel()

  • Reads Excel file into DataFrame.
  • Converts Excel sheet into table format.

Now check:


data.head()
  • Shows first 5 rows.
  • Used to check whether data loaded correctly.

Step 3: Handle Missing Values


data['Min Price (Rs./Quintal)'] = data['Min Price (Rs./Quintal)'].fillna(
    data['Min Price (Rs./Quintal)'].mean()
)

✅ Explanation

  • data['column'] → Selects specific column.
  • .mean() → Calculates average of that column.
  • .fillna(value) → Replaces missing values (NaN).

👉 Replace missing Min Price with average Min Price.

Then check:


data.isnull().sum()
  • isnull() → Checks missing values.
  • sum() → Counts missing values.
  • If all 0 → No missing data ✔

Step 4: Convert Categorical to Numeric


from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
data['Commodity'] = le.fit_transform(data['Commodity'])

✅ Explanation

Machine learning cannot understand text like "Cotton".

  • LabelEncoder() converts text into numbers.
Commodity After Encoding
Cotton 0
Wheat 1
Rice 2

Step 5: Separate X and Y


X = data[['Commodity', 'Year', 'Month',
          'Min Price (Rs./Quintal)',
          'Max Price (Rs./Quintal)']]

Y = data['Modal Price (Rs./Quintal)']

✅ Explanation

  • X → Features / Input variables.
  • Y → Target / Output variable.

Step 6: Manual Splitting


X_train = X.iloc[:80]
X_test = X.iloc[80:]

Y_train = Y.iloc[:80]
Y_test = Y.iloc[80:]

✅ Explanation

  • iloc → Index-based selection.
  • :80 → Rows 0 to 79.
  • 80: → Rows 80 to end.
Variable Meaning
X_train First 80 rows (training input)
X_test Remaining rows (testing input)
Y_train First 80 outputs
Y_test Remaining outputs

🔴 Problem: Manual splitting may cause bias if data is sorted.


Step 7: Automatic Splitting (Recommended)


X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y,
    test_size=0.2,
    random_state=42
)

✅ Explanation

  • train_test_split() splits data randomly.
  • test_size=0.2 → 20% testing, 80% training.
  • random_state=42 → Keeps result consistent.

Step 8: Check Shapes


print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

Example Output:


(228, 5)
(57, 5)
(228,)
(57,)
  • 228 training rows
  • 57 testing rows
  • 5 features
  • Y has single column

🔥 Manual vs Automatic Split

Manual Split Automatic Split
Sequential Random
Risk of bias Balanced
Fixed row count Percentage based
Not recommended generally Industry standard

🎯 Important Concept

Variable Meaning
X_train Input used to train model
Y_train Correct answers for training
X_test New input for testing
Y_test Real answers for checking accuracy


Share This
Previous Post
Next Post