0

In Kaggle competitions, we have a train and test dataset. So we usually develop a model on the training dataset and evaluate it with a test dataset that is unseen for the algorithm. I was wondering what is the best method for validation of a regression problem if just one dataset is given to us without any test dataset. I think there might be two approaches:

  1. At the first step, after importing the dataset, it is converted to train and test datasets, with this approach the test set will not see by the algorithm until the last step. After performing preprocessing and feature engineering, we can use cross-validation techniques on the training dataset or use train-test-split to improve the error of our model. Finally, the quality of the model can be checked by the unseen data.

  2. Also, I saw that for regression problems, some data scientists use the whole dataset for testing and validation, I mean they use all the data at the same time.

Could you please help me with which strategy is better? Especially, when the recruiter gives us just a dataset and asks us to develop a model to predict the target variable.

Thanks, Med

Med
  • 1
  • 1

2 Answers2

1

You must divide the Data set in to two parts : Training and validation datasets.

Then train your model on to the training data set. Validate the model on validation data set. The more data you have the better your model can be fitted. Quality checking of the model can be done with validation data set split earlier. You can also check the quality of your model by accuracy and scoring parameters.

When checking the quality of the model you can create your own custom data set which is similar to the values of the original data set.

When on Kaggle, the competition is about to be closed, they will release the actual test data set on which the result of the model is ranked.

The reason is that when you have more data, the algorithm will have more feature label pair to train and validate. This will increase the efficiency of the model.

Approach 2 described in the question is better.

Also, I saw that for regression problems, some data scientists use the whole data set for testing and validation, I mean they use all the data at the same time.

Approach one is not preferred as in a competitive platform your model has to perform better. So having lesser training and validation data can affect the accuracy.

Aagam Sheth
  • 685
  • 7
  • 15
  • Thanks, Aagam, Would you please let me know about your advice regarding the size of the validation dataset? The first dataset has around 1000 records. – Med Sep 04 '20 at 04:49
  • about 20% should be for Validation and 80% for Training. – Aagam Sheth Sep 04 '20 at 04:50
  • That is really dependent on the data set and the model. No answer can be right – Aagam Sheth Sep 04 '20 at 04:51
  • [Rule for Train Validation split](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio#:~:text=Roughly%2017.7%25%20should%20be%20reserved,validation%20and%2082.3%25%20for%20training.&text=Well%20you%20should%20think%20about,tell%20that%20model%20works%20fine.) – Aagam Sheth Sep 04 '20 at 04:53
0
  1. Divide your One dataset into a Training dataset and Testing dataset.
  2. While training your model divide your Training dataset into training, validation,and testing and run the model and check the accuracy & save the model.
  3. Import the save model and predict the testing dataset.
Rina
  • 1
  • 2
  • Thanks Rina. Do you have any idea about the size of the testing dataset? my original dataset has 1000 data points. – Med Sep 04 '20 at 04:47
  • You can divide the Training dataset into 90% and testing into 10%. And you can also check for the different ratios like 90%-10%,80%-20%,70%-30%. & then check your model's accuracy. – Rina Sep 04 '20 at 04:52