The train_test_split
function shuffles the rows of the original data and then takes a proportion to make the training dataset and the rest for testing.
So if train_size = 0.7
, the function will shuffle your data and save 70 per cent of the shuffled data for training and 30 per cent for testing.
If you run train_test_split(x, y, train_size=0.7)
without declaring a random state, the resulting split will (almost) always be different.
The reason we set a random state is to tell the function to shuffle the data identically every time in order to make our results replicable.
In other words, if you run train_test_split(x, y, train_size=0.7, random_state=123)
, you will always get the same result.
As for your code, note that the data you are splitting also changed in the second line. Here is your code with comments:
# Divide `x` and `y` in 70% train and 30% test
# Note that you are splitting `x` and `y` ▼ ▼
x_train, x_test, y_train, y_test = train_test_split(x, y,
train_size=0.7,
random_state=123)
# Split the 70% into 80% train and 20% validation
# Note that you are not splitting `x` and `y` anymore ▼ ▼
part_x_train, x_val, part_y_train, y_val = train_test_split(x_train, y_train,
test_size=0.2,
random_state=2)
Note that in the second split you are splitting x_train
and y_train
.
This means your code takes 70% of the original data to create a training dataset and then splits that new subset in 80% for training and 20% for validation.