What is the difference between cross-validation and grid search?

Question

In simple words, what is the difference between cross-validation and grid search? How does grid search work? Should I do first a cross-validation and then a grid search?

score 105 · Answer 1 · edited Jan 16 '19 at 11:49

Cross-validation is when you reserve part of your data to use in evaluating your model. There are different cross-validation methods. The simplest conceptually is to just take 70% (just making up a number here, it doesn't have to be 70%) of your data and use that for training, and then use the remaining 30% of the data to evaluate the model's performance. The reason you need different data for training and evaluating the model is to protect against overfitting. There are other (slightly more involved) cross-validation techniques, of course, like k-fold cross-validation, which often used in practice.

Grid search is a method to perform hyper-parameter optimisation, that is, it is a method to find the best combination of hyper-parameters (an example of an hyper-parameter is the learning rate of the optimiser), for a given model (e.g. a CNN) and test dataset. In this scenario, you have several models, each with a different combination of hyper-parameters. Each of these combinations of parameters, which correspond to a single model, can be said to lie on a point of a "grid". The goal is then to train each of these models and evaluate them e.g. using cross-validation. You then select the one that performed best.

To give a concrete example, if you're using a support vector machine, you could use different values for gamma and C. So, for example, you could have a grid with the following values for (gamma, C): (1, 1), (0.1, 1), (1, 10), (0.1, 10). It's a grid because it's like a product of [1, 0.1] for gamma and [1, 10] for C. Grid-search would basically train a SVM for each of these four pair of (gamma, C) values, then evaluate it using cross-validation, and select the one that did best.

If I have complete SearchGridCV and got the desire estimator, do I need to run cross_validate again ? Would the cross_validate result be the same as the best estimator from SearchGridCV ? — EBDS, Dec 29 '21 at 07:38

score 20 · Answer 2 · edited Oct 11 '18 at 10:49

20

Cross-validation is a method for robustly estimating test-set performance (generalization) of a model. Grid-search is a way to select the best of a family of models, parametrized by a grid of parameters.

Here, by "model", I don't mean a trained instance, more the algorithms together with the parameters, such as SVC(C=1, kernel='poly').

edited Oct 11 '18 at 10:49

nbro

15,395
32
113
196

answered Oct 12 '13 at 22:01

Andreas Mueller

27,470
8
62
74

3

Well i understand that. But in the Example of scikit-learn there is at first a split of the data_set by doing `X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=0` and then there ist in the grid search `clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)`so does this means that the first step split for e.g a 1000 training set into 500 train and 500 test objects and after that the grid search splits the trainingset of 500 into "cv=5" 5-fold cross validation?So the 500 objects where split into maybe 250 and 250 or 400 and 100 and so on?! – Linda Oct 13 '13 at 07:59
11

Yes, that's right. Half the data is being reserved for evaluation **after** the grid search model selection (which uses 5-fold cross-validation). The reason is that they don't just want to select the best model, but also to have a good estimate for how well it generalizes (how well it performs on new data). You can't just use the score from the grid search cross-validation, because you chose the model that scored highest on that, so there may be some kind of selection bias built into its score. So that's why they keep part of the data to test on after grid search is over. – Or Neeman Oct 14 '13 at 17:14

score 15 · Answer 3 · edited Feb 21 '18 at 17:40

Cross-validation, simply separating test and training data and validate training results with test data. There are two cross validation techniques that I know.

First, Test/Train cross validation. Splitting data as test and train.

Second, k-fold cross-validation split your data into k bins, use each bin as testing data and use rest of the data as training data and validate against testing data. Repeat the process k times. And Get the average performance. k-fold cross validation especially useful for small dataset since it maximizes both the test and training data.

Grid Search; systematically working through multiple combinations of parameter tunes, cross validate each and determine which one gives the best performance.You can work through many combination only changing parameters a bit.

score 0 · Answer 4 · edited Jan 16 '19 at 11:57

Cross-validation is a method of reserving a particular subset of your dataset on which you do not train the model. Later, you test your model on this subset before finalizing it.

The main steps you need to perform to do cross-validation are:

Split the whole dataset in training and test datasets (e.g. 80% of the whole dataset is the training dataset and the remaining 20% is the test dataset)
Train the model using the training dataset
Test your model on the test dataset. If your model performs well on the test dataset, continue the training process

There are other cross-validation methods, for example

Leave-one-out cross-validation (LOOCV)
K-fold cross-validation
Stratified K-fold cross-validation
Adversarial cross-validation strategies (used when train and rest datasets are differ largely from each other).

This does not answer the original question. You're not explaining the difference between cross-validation and grid search. — nbro, Jan 16 '19 at 11:57

score 0 · Answer 5 · answered May 05 '23 at 11:33

It is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test (you can decide how much, mostly between 20-30%). In cross_validation you run several runs with different with different train sets:

So in that case you run 5-folds (5 runs) and in every Fold you pick a different train set with (X_train,y_train). The test data is always hold-out. This is used to avoid overfitting, that means to avoid that the model is trained for one specific problem but when new data arrived it would not give a nice result.

The best parameters can be determined by grid search techniques. Most machine learning models have the possibility to adjust the parameters to find the best results, for example in a decion tree you can adjust the number of nodes in the parameter list.

Normally, if you want to develop a good machine learning model, you use a combination of both techniques: Cross-Validation with grid search.

score -11 · Answer 6 · edited Apr 29 '19 at 09:27

-11

In simple terms, consider making pasta as building a model:

Cross validation - choosing the quantity of pasta
Grid search - choosing the right proportion of ingredients.

edited Apr 29 '19 at 09:27

char

2,063
3
15
26

answered Apr 29 '19 at 07:29

Dinesh Varma Indukuri

3
1

What is the difference between cross-validation and grid search?

6 Answers6