The Importance of Train/Test Split in Machine Learning Models
If you have ever glazed your eyes on a machine learning model script written in Python, you must have seen a line that is pretty similar to the following:
# test_size defines the proportion of the data to be used as the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
And you are likely using one in your model.py even if you did not notice it yet. This guide is meant to provide beginners with an introductory guide to what it is, how it is used, and why it is important to pay attention to it.
In machine learning, the goal is to create models that can generalize well to unseen data. To evaluate how well a model performs, we need a way to test it on data it hasn’t encountered during training. If we trained and tested the model on the same data, we could not confidently assess how well it would perform on new data. This is where the train/test split becomes crucial.
As the name says, this operation splits the dataset into two parts: trainset and testset—one for training the model and another for testing it. Now let’s look at the function again and see what each part of it means:
- X_train, X_test, y_train, y_test — these are the outputs of the function based on the defined function parameters
-
train_test_split(X, y, test_size=0.2, random_state=42) — here is the function definition that has the following parts:
– X, y is the original dataset before splitting
– test_size=0.2 is the ratio between the trainset and the testset for which we specify the size of the testset as a portion of the original dataset
– random_state=42 is the random basket of the split dataset, by specifying the random_state you can be sure that you get the same trainset and testset every time given the same original dataset. On the contrary, you can force changes in the post-split dataset by changing the seed number.
There is no one-size-fits-all ratio for splitting a dataset into training and test sets. Still, common practices have emerged based on dataset size, the nature of the problem, and the computational resources available.
- 80/20 Split: The most widely used ratio is an 80/20 split. This split strikes a good balance between having enough data to train the model and keeping a sufficient portion for an unbiased evaluation. It’s ideal for most datasets, especially when the data size is reasonably large.
- 70/30 Split: In cases where more testing is needed to evaluate the model’s performance, a 70/30 split can be used. This split might be beneficial when the dataset size is smaller, or when we need more confidence in testing results.
- 90/10 Split: For extremely large datasets, where even 10% of the data constitutes a substantial test set, a 90/10 split is often sufficient. This leaves the majority of the data for training, which can be useful for training complex models such as deep learning architectures.
The goal of splitting the original dataset into two parts is to address Overfitting and Underfitting. They are two common issues in machine learning model development.
Overfitting occurs when a model performs well on training data but poorly on test data because it has learned patterns specific to the training set rather than generalizing to unseen data. This can be caused by several factors:
- A train/test split that is too small (e.g., if the training set is too small or the test set is not representative).
- A model that is too complex (e.g., with too many parameters relative to the amount of data).
- Data leakage, where the model inadvertently learns from the test set (by having overlapping data between training and testing).
Underfitting happens when a model is too simplistic to capture patterns in the data, leading to poor performance on both training and test sets. It may result from an overly simple model or insufficient training data. It is often caused by:
- A model that is not complex enough to learn from the data (e.g., linear models for highly non-linear problems).
- Not enough training data, meaning the model doesn’t have sufficient examples to learn from.
A well-balanced train/test split can help mitigate over/underfitting by providing the model with enough data to learn, while the test set allows for early detection of these issues.
This section is the heart and soul of this article inspired by the current point campaign currently ongoing at Allora — self-improving decentralized intelligence built by the community.
read more about the point campaign here:
https://app.allora.network/points/overview
Most topics on the campaign are related to price forecasting which means it is a Time-Series forecasting. Preventing Data Leakage from Train/Test split for Time-Series can be vital to how well a model can generalize, and the common random split is not a suitable method as it introduces data leakage. To understand why random split introduces data leakage, we have to go back to the purpose of Time-Series forecasting which is to predict the future using the data from the past. A trainset from a random split can contain data from the latest “future” available in the original dataset; therefore, the model is trained with foresight of how the future may look like, and this fits the definition of data leakage.
Time-Series train/test split
A suitable method to split the dataset in the case of a Time-Series forecast is to “split the future and the past”