9

I was reading about cross validation and about how it it is used to select the best model and estimate parameters , I did not really understand the meaning of it.

Suppose I build a Linear regression model and go for a 10 fold cross validation, I think each of the 10 will have different coefficiant values , now from 10 different which should I pick as my final model or estimate parameters.

Or do we use Cross Validation only for the purpose of finding an average error(average of 10 models in our case) and comparing against another model ?

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
av abhishiek
  • 647
  • 2
  • 11
  • 26
  • 4
    My understanding is that CV gives you an estimate of the error for a model trained on *all* the data. So I think after you have made the 10 models as you have described, you would still then need to train an 11th model but using all 10 folds for training. You then the average CV error as an estimate of the error of this 11th model. – Dan Jan 04 '18 at 16:08

5 Answers5

8

If your build a Linear regression model and go for a 10 fold cross validation, indeed each of the 10 will have different coefficient values. The reason why you use cross validation is that you get a robust idea of the error of your linear model - rather than just evaluating it on one train/test split only, which could be unfortunate or too lucky. CV is more robust as no ten splits can be all ten lucky or all ten unfortunate.

Your final model is then trained on the whole training set - this is where your final coefficients come from.

shosaco
  • 5,915
  • 1
  • 30
  • 48
  • Do you have any book or paper so i can cite this "Your final model is then trained on the whole training set - this is where your final coefficients come from."? – Murilo Jul 19 '23 at 09:34
2

Cross-validation is used to see how good your models prediction is. It's pretty smart making multiple tests on the same data by splitting it as you probably know (i.e. if you don't have enough training data this is good to use).

As an example it might be used to make sure you aren't overfitting the function. So basically you try your function when you've finished it with Cross-validation and if you see that the error grows a lot somewhere you go back to tweaking the parameters.

Edit: Read the wikipedia for deeper understanding of how it works: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29

  • 1
    my confusion stems from the fact that when we do k-fold cross validation , we are in essence building k seperate models, so to check model efficiency i.e estimate the error we take average of all the errors from the K folds – av abhishiek Aug 03 '17 at 10:08
  • Maybe Mohammad Kashif is correct about you confusing this with Grid-search. See his answer please. – Hampus Londögård Aug 03 '17 at 10:52
1

You are basically confusing Grid-search with cross-validation. The idea behind cross-validation is basically to check how well a model will perform in say a real world application. So we basically try randomly splitting the data in different proportions and validate it's performance. It should be noted that the parameters of the model remain the same throughout the cross-validation process.

In Grid-search we try to find the best possible parameters that would give the best results over a specific split of data (say 70% train and 30% test). So in this case, for different combinations of the same model, the dataset remains constant.

Read more about cross-validation here.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
1

Cross Validation is mainly used for the comparison of different models. For each model, you may get the average generalization error on the k validation sets. Then you will be able to choose the model with the lowest average generation error as your optimal model.

1

Cross-Validation or CV allows us to compare different machine learning methods and get a sense of how well they will work in practice.

Scenario-1 (Directly related to the question)

  • Yes, CV can be used to know which method (SVM, Random Forest, etc) will perform best and we can pick that method to work further.

(From these methods different models will be generated and evaluated for each method and an average metric is calculated for each method and the best average metric will help in selecting the method)

  • After getting the information about the best method/ or best parameters we can train/retrain our model on the training dataset.
  • For parameters or coefficients, these can be determined by grid search techniques. See grid search

Scenario-2:

Suppose you have a small amount of data and you want to perform training, validation and testing on data. Then dividing such a small amount of data into three sets reduce the training samples drastically and the result will depend on the choice of pairs of training and validation sets. CV will come to the rescue here. In this case, we don't need the validation set but we still need to hold the test data. A model will be trained on k-1 folds of training data and the remaining 1 fold will be used for validating the data. A mean and standard deviation metric will be generated to see how well the model will perform in practice.