In Kaggle competitions, we have a train and test dataset. So we usually develop a model on the training dataset and evaluate it with a test dataset that is unseen for the algorithm. I was wondering what is the best method for validation of a regression problem if just one dataset is given to us without any test dataset. I think there might be two approaches:
At the first step, after importing the dataset, it is converted to train and test datasets, with this approach the test set will not see by the algorithm until the last step. After performing preprocessing and feature engineering, we can use cross-validation techniques on the training dataset or use train-test-split to improve the error of our model. Finally, the quality of the model can be checked by the unseen data.
Also, I saw that for regression problems, some data scientists use the whole dataset for testing and validation, I mean they use all the data at the same time.
Could you please help me with which strategy is better? Especially, when the recruiter gives us just a dataset and asks us to develop a model to predict the target variable.
Thanks, Med