I just noticed that many people tend to use train_test_split even before handling the missing data, and seem like they split the data at the very beginning
and there are also a bunch of people, they tend to slipt the data right before model building step after they do all the data cleaning and feature engineering, feature selection stuff.
The people tend to split the data at the very first saying that it is to prevent the data leakage.
I am right now just so confused about the pipeline of building a model. why do we need to slipt the data at the very beginning? and to clean the train set and test set separately when we can actually do all the data cleaning and feature engineering or things like transforming the categorical variable to dummy variable together for convenience purpose?
Please help me with this Really wanna know a convenient and scientific pipeline