Thoughts about train_test_split for machine learning

Question

I just noticed that many people tend to use train_test_split even before handling the missing data, and seem like they split the data at the very beginning

and there are also a bunch of people, they tend to slipt the data right before model building step after they do all the data cleaning and feature engineering, feature selection stuff.

The people tend to split the data at the very first saying that it is to prevent the data leakage.

I am right now just so confused about the pipeline of building a model. why do we need to slipt the data at the very beginning? and to clean the train set and test set separately when we can actually do all the data cleaning and feature engineering or things like transforming the categorical variable to dummy variable together for convenience purpose?

Please help me with this Really wanna know a convenient and scientific pipeline

For a discussion (including a code demonstration) of the issue regarding feature selection, see the two answers in [Should Feature Selection be done before Train-Test Split or after?](https://stackoverflow.com/questions/56308116/should-feature-selection-be-done-before-train-test-split-or-after) (disclaimer: mine). — desertnaut, Apr 16 '20 at 11:18

mcskinner · Answer 1 · 2020-04-16T06:10:06.793

You should split the data as early as possible.

To put it simply, your data engineering pipeline builds models too.

Consider the simple idea of filling in missing values. To do this you need to "train" a mini-model to generate the mean or mode or some other average to use. Then you use this model to "predict" missing values.

If you include the test data in the training process for these mini-models, then you are letting the training process peek at that data and cheat a little bit because of that. When it fills in the missing data, with values built using the test data, it is leaving little hints about what the test set is like. This is what "data leakage" means in practice. In an ideal world you could ignore it, and instead just use all data for training use the training score to decide which model is best.

But that won't work, because in practice a model is only useful once it is able to predict any new data, and not just the data available at training time. Google Translate needs to work on whatever you and I type in today, not just what it was trained with earlier.

So, in order to ensure that the model will continue to work well when that happens, you should test it on some new data in a more controlled way. Using a test set, which has been split out as early as possible and then hidden away, is the standard way to do that.

Yes, it means some inconvenience to split the data engineering up for training vs testing. But many tools like scikit, which splits the fit and transform stages, make it convenient to build an end-to-end data engineering and modeling pipeline with the right train/test separation.

Well, kudos for the "*data engineering pipeline builds models*", which is exactly the case and beautifully put (+1). I had a *really* hard time trying to convince the OP in [Should Feature Selection be done before Train-Test Split or after?](https://stackoverflow.com/questions/56308116/should-feature-selection-be-done-before-train-test-split-or-after), who was insisting that "*this is preprocessing, not modeling*" (I didn't succeed) :( — desertnaut, Apr 16 '20 at 11:22
@YOUWANG I am glad to hear this was helpful :) if you think others will also find it helpful, please consider accepting it. — mcskinner, Apr 16 '20 at 23:09

Thoughts about train_test_split for machine learning

1 Answers1

Linked

Related