1️⃣ Why Do We Split Data?
When we build a Machine Learning model, we must:
- Train the model → So it can learn patterns
- Test the model → So we can check how well it learned
If we test on the same data used for training:
- Model may give 100% accuracy
- But this is not real performance
- It just memorized the data
This problem is called Overfitting.
So we divide data into:
- Training Set → To teach the model
- Testing Set → To evaluate the model
2️⃣ Simple Human Example
Imagine:
- we study 100 math problems → (Training Data)
- In exam, teacher gives 30 new problems → (Testing Data)
our exam marks show how much we actually learned.
Same way:
- Model learns from old data
- Predicts on new unseen data
- We calculate accuracy in percentage
3️⃣ Standard Data Split Ratio
- 70% – 80% → Training Data
- 20% – 30% → Testing Data
Example:
If dataset has 1000 rows:
- 800 rows → Training
- 200 rows → Testing
Training data is always larger because the model needs more data to learn properly.
4️⃣ Practical Example (Crop Price Dataset)
| Year | Rainfall | WPI |
|---|---|---|
| 2012 | 800 | 150 |
| 2013 | 750 | 160 |
Where:
- Rainfall → Input (X)
- WPI → Output (y)
5️⃣ Method 1: Using SKLearn (Recommended)
from sklearn.model_selection import train_test_split
Step 1️⃣: Load Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv("Cotton.csv")
Step 2️⃣: Define Features (X) and Target (y)
X = data[['Rainfall']] y = data['WPI']
Step 3️⃣: Split Data
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
Parameter Explanation
| Parameter | Meaning |
|---|---|
| X, y | Input and output data |
| test_size=0.2 | 20% data for testing |
| random_state=42 | Keeps result same every time |
Step 4️⃣: Check Shape
print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)
🔹 6️⃣ What Happens Internally?
- Data is shuffled randomly
- 80% goes to training
- 20% goes to testing
- Model learns from training
- Model predicts on testing
- Accuracy is calculated
🔹 7️⃣ Manual Splitting (Without SKLearn)
train_size = int(0.8 * len(data)) train = data[:train_size] test = data[train_size:]
Problem:
- No automatic shuffling
- May cause biased split
So SKLearn method is better.
🔹 8️⃣ Accuracy Check Example
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("R2 Score:", r2_score(y_test, y_pred))
This tells how much model learned from training data.
🔹 9️⃣ Important Terminologies
- Generalization → Model performs well on unseen data
- Overfitting → Good on training, poor on testing
- Underfitting → Poor on both training and testing
- Data Leakage → Testing data used during training
🔹 🔟 Why Train-Test Split is Necessary?
- To measure real performance
- To avoid overfitting
- To test model reliability
- To simulate real-world prediction
- Required for research publication
Without proper split:
- Results are misleading
- Model is not trustworthy
🔹 1️⃣ Important Case: Time Series Data
In time-series models like ARIMA, LSTM, GRU:
Do NOT randomly split data.
Correct Chronological Split Example:
| Year | Price |
|---|---|
| 2001 | 100 |
| 2002 | 120 |
| 2003 | 140 |
| 2004 | 150 |
Correct split:
- 2001–2003 → Training
- 2004 → Testing
🔹 2️⃣ When Not to Use Simple Train-Test Split?
- Cross Validation (Small dataset)
- K-Fold Cross Validation
- TimeSeriesSplit (Forecasting)
- Walk Forward Validation (Advanced time-series)
Train–Test Splitting on Crop Price Dataset
🔹 Step 1: Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
✅ Explanation
1️⃣ import pandas as pd
- pandas is used for handling structured data (tables).
- pd is a short name (alias).
2️⃣ from sklearn.model_selection import train_test_split
- Imports function used for automatic data splitting.
- Part of the scikit-learn library.
Step 2: Load Excel File
data = pd.read_excel("/content/drive/MyDrive/cropdata/monthly_average_prices_with_month_year_2025_test.xlsx")
✅ Explanation
pd.read_excel()
- Reads Excel file into DataFrame.
- Converts Excel sheet into table format.
Now check:
data.head()
- Shows first 5 rows.
- Used to check whether data loaded correctly.
Step 3: Handle Missing Values
data['Min Price (Rs./Quintal)'] = data['Min Price (Rs./Quintal)'].fillna(
data['Min Price (Rs./Quintal)'].mean()
)
✅ Explanation
- data['column'] → Selects specific column.
- .mean() → Calculates average of that column.
- .fillna(value) → Replaces missing values (NaN).
👉 Replace missing Min Price with average Min Price.
Then check:
data.isnull().sum()
- isnull() → Checks missing values.
- sum() → Counts missing values.
- If all 0 → No missing data ✔
Step 4: Convert Categorical to Numeric
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Commodity'] = le.fit_transform(data['Commodity'])
✅ Explanation
Machine learning cannot understand text like "Cotton".
- LabelEncoder() converts text into numbers.
| Commodity | After Encoding |
|---|---|
| Cotton | 0 |
| Wheat | 1 |
| Rice | 2 |
Step 5: Separate X and Y
X = data[['Commodity', 'Year', 'Month',
'Min Price (Rs./Quintal)',
'Max Price (Rs./Quintal)']]
Y = data['Modal Price (Rs./Quintal)']
✅ Explanation
- X → Features / Input variables.
- Y → Target / Output variable.
Step 6: Manual Splitting
X_train = X.iloc[:80]
X_test = X.iloc[80:]
Y_train = Y.iloc[:80]
Y_test = Y.iloc[80:]
✅ Explanation
- iloc → Index-based selection.
- :80 → Rows 0 to 79.
- 80: → Rows 80 to end.
| Variable | Meaning |
|---|---|
| X_train | First 80 rows (training input) |
| X_test | Remaining rows (testing input) |
| Y_train | First 80 outputs |
| Y_test | Remaining outputs |
🔴 Problem: Manual splitting may cause bias if data is sorted.
Step 7: Automatic Splitting (Recommended)
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y,
test_size=0.2,
random_state=42
)
✅ Explanation
- train_test_split() splits data randomly.
- test_size=0.2 → 20% testing, 80% training.
- random_state=42 → Keeps result consistent.
Step 8: Check Shapes
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
Example Output:
(228, 5)
(57, 5)
(228,)
(57,)
- 228 training rows
- 57 testing rows
- 5 features
- Y has single column
🔥 Manual vs Automatic Split
| Manual Split | Automatic Split |
|---|---|
| Sequential | Random |
| Risk of bias | Balanced |
| Fixed row count | Percentage based |
| Not recommended generally | Industry standard |
🎯 Important Concept
| Variable | Meaning |
|---|---|
| X_train | Input used to train model |
| Y_train | Correct answers for training |
| X_test | New input for testing |
| Y_test | Real answers for checking accuracy |

