Best practice for train, validation and test set

Question

I want to assign a sample class to each instance in a dataframe - 'train', 'validation' and 'test'. If I use sklearn train_test_split(), twice, I can get the indices for a train, validation and test set like this:

X = df.drop(['target'], axis=1)
y=df[['target']]

X_train, X_test, y_train, y_test, indices_train, indices_test=train_test_split(X, y, df.index, 
                                                                             test_size=0.2, 
                                                                             random_state=10, 
                                                                             stratify=y, 
                                                                             shuffle=True)
df_=df.iloc[indices_train]

X_ = df_.drop(['target'], axis=1)
y_=df_[['target']]

X_train, X_val, y_train, y_val, indices_train, indices_val=train_test_split(X_, y_, df_.index, 
                                                                             test_size=0.15, 
                                                                             random_state=10, 
                                                                             stratify=y_, 
                                                                             shuffle=True)

df['sample']=['train' if i in indices_train else 'test' if i in indices_test else 'val' for i in df.index]

What is best practice to get a train, validation and test set? Is there any problems with my approach above and can it be frased better?

You are splitting in the wrong way your data. Give me a couple of minutes to elaborate. — Luis Alejandro Vargas Ramos, Sep 05 '22 at 05:11

murari prasad · Answer 1 · 2022-09-05T05:10:12.047

2

a faster and optimal solution if dataset is large would be using numpy.

How to split data into 3 sets (train, validation and test)?

or the simpler way is your solution, but maybe just feed the x_train, y_train you obtained in the 1 step, for the train validation split? like the indices being stored and rows just removed from the df feels unnecessary.

edited Sep 05 '22 at 05:10

answered Sep 05 '22 at 05:09

murari prasad

86
6

Thanks for your answer. The indices are important to be able to create the feature 'sample' in the original dataframe. – Henri Sep 05 '22 at 05:23
is it necessary to store which record went in which set using that sample class? wont it be simple to store your 3 sets separately and then just access them when required. Like why manipulate the source data. I am curious now xD – murari prasad Sep 05 '22 at 05:45
It's not necessary, it just want I want. Your way is fine aswell. I have no argument aginst that. :) – Henri Sep 05 '22 at 05:49

score 1 · Accepted Answer · answered Sep 05 '22 at 05:24

1

So, I did a dummy dataset of 100 points. I separate the data and I did the first split:

X = df.drop('target', axis=1)
y = df['target']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

If you have a look, my test size is 0.3 which means 70 data points will go for traininf and 30 for test and validation as well.

X_train.shape # Output (70, 3)
X_test.shape # Output (30, 3)

Now you need to split again for validation, so you can do it like this:

X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5)

Notice how I name the groups and the test_size is now 0.5. Which means I take the 30 points for test and I splitted for validation as well. So the shape of validation and testing, will be:

X_val.shape # Output (15, 3)
X_test.shape # Output (15, 3)

At the end you have 70 points for training, 15 for testing and 15 for validation. Now, consider validation as "double check" of your training. There are a lot of messy concepts related with that. It's just be sure of your training.

answered Sep 05 '22 at 05:24

Luis Alejandro Vargas Ramos

990
2
8
18

Thanks. It looks quite similar to my approach, except for the size of the splits. Is it important that the validation and test size are the same size you recon? – Henri Sep 05 '22 at 05:33
1

Up to you. But in my opinion, it's best approach go for cross-validation. You avoid things like overfitting with that. But if your dataset is huge, go for train test split. – Luis Alejandro Vargas Ramos Sep 05 '22 at 05:39
:) Thanks for input. Do you mean I don't need a validation set if my dataset is huge? – Henri Sep 05 '22 at 05:51
1

Cross-validation will take forever. In that case, go for train-test-split. – Luis Alejandro Vargas Ramos Sep 05 '22 at 05:54
Oh yes, you right. However, I need a validation set for XGBoost, even if it will take forever :) – Henri Sep 05 '22 at 05:55
1

It really doesn't matter too much the validation set. As I said, this is like a "double check of your training", so if you can go with cross-validation, you will get a better training. – Luis Alejandro Vargas Ramos Sep 05 '22 at 05:58

Best practice for train, validation and test set

2 Answers2