29

I've always thought from what I read that cross validation is performed like this:

In k-fold cross-validation, the original sample is randomly partitioned into k subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation

So k models are built and the final one is the average of those. In Weka guide is written that each model is always built using ALL the data set. So how does cross validation in Weka work ? Is the model built from all data and the "cross-validation" means that k fold are created then each fold is evaluated on it and the final output results is simply the averaged result from folds?

Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
Titus Pullo
  • 3,751
  • 15
  • 45
  • 65

6 Answers6

52

So, here is the scenario again: you have 100 labeled data

Use training set

  • weka will take 100 labeled data
  • it will apply an algorithm to build a classifier from these 100 data
  • it applies that classifier AGAIN on these 100 data
  • it provides you with the performance of the classifier (applied to the same 100 data from which it was developed)

Use 10 fold CV

  • Weka takes 100 labeled data

  • it produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are used for training and 10 labeled data are used for testing.

  • it produces a classifier with an algorithm from 90 labeled data and applies that on the 10 testing data for set 1.

  • It does the same thing for set 2 to 10 and produces 9 more classifiers

  • it averages the performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing) sets

Let me know if that answers your question.

Rushdi Shams
  • 2,423
  • 19
  • 31
  • 1
    I've got 2 question:1) If it's like you said why in Weka Guide is written that in each case (training set and CV) the model is always built from all data? As you wrote in CV the final model is an average of other 10 models, right? 2) If "The model you get at this point is the average of all the 10 models" how is possible that using training set and CV as validation I got same models? (Hope these questions doesn't appear too silly!) – Titus Pullo May 10 '12 at 18:42
  • 1. This means for every fold full dataset is considered. There are some variations of this standard CV where part of the datasets are held out for a separately test. 2. What exactly do you mean by "getting same models"? – Rushdi Shams May 11 '12 at 03:20
  • For "same models" I mean that in output I've got the exactly same tree – Titus Pullo May 11 '12 at 07:52
  • Did you consider a 29 fold cv? Can you Please do a 29 fold cv and update whether or not the tree is the same as "use training set"? – Rushdi Shams May 11 '12 at 22:08
  • 1
    Have a look at this post [link](https://list.scms.waikato.ac.nz/pipermail/wekalist/2009-December/046633.html). So the model is exactly the same in each validation option!Let me know if you're agree with me – Titus Pullo May 12 '12 at 08:04
  • Oh wow! great! This conversation really helped me a lot. Weka provides the average output, but saves the model on full set! Great, good to know! I edited the last point of my last answer. Thanks. This learning really was good for me. Hope you got something, too. – Rushdi Shams May 12 '12 at 19:46
  • 3
    So, for the community, I am sorry that I did not know that Weka provides you the same model no matter whether you choose trainining set or 10 fold CV. I made necessary corrections to my answers and comments so that nobody is getting the misconception I had previously about Weka though this is usual practice in ML community to report either the best model or the average model from 10 fold CV. I knew that Weka provides the average model but I was completely wrong. Thanks @Lazza87. – Rushdi Shams May 12 '12 at 23:05
  • What is "the average model"? – Andreas Jan 10 '13 at 17:13
  • @Andreas, like for a regression problem, averaging the values of a particular feature from all the k models of a k-fold CV – Rushdi Shams Jan 14 '13 at 16:20
  • 4
    @Lazza87, your link is dead, could you please update it? thanks – Mohamed Taher Alrefaie Jan 13 '14 at 14:22
11

I would have answered in a comment but my reputation still doesn't allow me to:

In addition to Rushdi's accepted answer, I want to emphasize that the models which are created for the cross-validation fold sets are all discarded after the performance measurements have been carried out and averaged.

The resulting model is always based on the full training set, regardless of your test options. Since M-T-A was asking for an update to the quoted link, here it is: https://web.archive.org/web/20170519110106/http://list.waikato.ac.nz/pipermail/wekalist/2009-December/046633.html/. It's an answer from one of the WEKA maintainers, pointing out just what I wrote.

Hein Blöd
  • 1,553
  • 1
  • 18
  • 25
  • Do you know if there is a way to see the models created for the cross validation? – drevicko Aug 12 '15 at 05:04
  • yes: see posts on the weka mailing list [here](http://list.waikato.ac.nz/pipermail/wekalist/2015-July/064572.html) and [here](http://list.waikato.ac.nz/pipermail/wekalist/2011-November/053965.html) – drevicko Aug 12 '15 at 05:21
  • 2
    But the what's the purpose of cross-validation? If the final model given to user is based on full dataset, why do we need cross-validation? I think cross-validation is to find the best model – lenhhoxung Aug 27 '15 at 13:18
  • 1
    @lenhhoxung As far as I understand, we want to optimize the parameters of the algorithm that we use to build our model. Like certain (hyper-)parameters of an SVM or ANN. That's why we assess how well these parameters perform (in case of CV) for different training sets and validation sets. For the resulting model that might be used for actual predictions, it's usually an advantage to use as much training data as possible, therefore it makes sense to build it on the whole dataset. – glaed Oct 11 '16 at 10:59
5

I think I figured it out. Take (for example) weka.classifiers.rules.OneR -x 10 -d outmodel.xxx. This does two things:

  1. It creates a model based on the full dataset. This is the model that is written to outmodel.xxx. This model is not used as part of cross-validation.
  2. Then cross-validation is run. cross-validation involves creating (in this case) 10 new models with the training and testing on segments of the data as has been described. The key is the models used in cross-validation are temporary and only used to generate statistics. They are not equivalent to, or used for the model that is given to the user.
Sicco
  • 6,167
  • 5
  • 45
  • 61
cwins
  • 51
  • 1
  • 1
  • 1
    But the what's the purpose of cross-validation? If the final model given to user is based on full dataset, why do we need cross-validation? I think cross-validation is to find the best model – lenhhoxung Aug 27 '15 at 13:18
  • Cross-validation is **not** used as a way of finding the best model, it is merely an approach to make the most out of limited data for calculating statistics (each row in your data will be used for testing). – fracpete Oct 19 '21 at 20:57
1

Weka follows the conventional k-fold cross validation you mentioned here. You have the full data set, then divide it into k nos of equal sets (k1, k2, ... , k10 for example for 10 fold CV) without overlaps. Then at the first run, take k1 to k9 as training set and develop a model. Use that model on k10 to get the performance. Next comes k1 to k8 and k10 as training set. Develop a model from them and apply it to k9 to get the performance. In this way, use all the folds where each fold at most 1 time is used as test set.

Then Weka averages the performances and presents that on the output pane.

Rushdi Shams
  • 2,423
  • 19
  • 31
  • 2
    Ok but in this way how is built the final model? Is an average of the 10 models built on CV? If yes what does mean: is always built using ALL the data set? – Titus Pullo May 04 '12 at 07:47
  • If you select 10 fold cross validation on the classify tab in Weka explorer, then the model you get is the one that you get with 10 9-1 splits. You will not have 10 individual models but 1 single model. And yes, you get that from Weka (not particularly Weka, it is applicable to general 10 fold CV theory) as it runs through the entire dataset. – Rushdi Shams May 07 '12 at 17:50
  • I'm sorry but I can't understand at all...So what is the differnce between choose from "Use training set" and "Cross Validation" in terms of how the model?The final model is the same! – Titus Pullo May 08 '12 at 14:46
  • When you are using "Use training set", then if you have 100 instances, Weka uses a "classification algorithm" defined by you to build model from all 100 instances. Then to test, it uses the same 100 instances. Therefore, normally on "Use training set" gives good precision-recall and fmeasure. But when you are using 10 fold CV, then it builds 10 different models with 10 different folds and gives you the average precision-recall-fmeasure. Sometimes it is necessary to use "training set" but most of the cases, 10 fold cv is preferable. FINAL MODEL WITH THESE TWO DIFFERENT SETUPS ARE NEVER SAME. – Rushdi Shams May 09 '12 at 19:54
  • So tell me if I understood well: Using CV we build 10 models that are "similar" to the real one, that is built using "similar" data and that should permit to evaluate the model using data that should emulate those that will be available in the future? – Titus Pullo May 10 '12 at 13:30
1

once we've done the 10-cross-validation by dividing data in 10 segments & create Decision tree and evaluate, what Weka does is run the algorithm an eleventh time on the whole dataset. That will then produce a classifier that we might deploy in practice. We use 10-fold cross-validation in order to get an evaluation result and estimate of the error, and then finally we do classification one more time to get an actual classifier to use in practice. During kth cross validation, we will going to have different Decision tree but final one is created on whole datasets. CV is used to see if we have overfitting or large variance issue.

0

According to "Data Mining with Weka" at The University of Waikato:

Cross-validation is a way of improving upon repeated holdout.
Cross-validation is a systematic way of doing repeated holdout that actually improves upon it by reducing the variance of the estimate.

  • We take a training set and we create a classifier
  • Then we’re looking to evaluate the performance of that classifier, and there’s a certain amount of variance in that evaluation, because it’s all statistical underneath.
  • We want to keep the variance in the estimate as low as possible.
    Cross-validation is a way of reducing the variance, and a variant on cross-validation called “stratified cross-validation” reduces it even further. (In contrast to the the “repeated holdout” method in which we hold out 10% for the testing and we repeat that 10 times.)

So how does cross validation in Weka work ?:
With cross-validation, we divide our dataset just once, but we divide into k pieces, for example , 10 pieces.
Then we take 9 of the pieces and use them for training and the last piece we use for testing. Then with the same division, we take another 9 pieces and use them for training and the held-out piece for testing. We do the whole thing 10 times, using a different segment for testing each time. In other words, we divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train on the rest, do the testing and average the 10 results.


That would be 10-fold cross-validation. Divide the dataset into 10 parts (these are called “folds”); hold out each part in turn; and average the results. So each data point in the dataset is used once for testing and 9 times for training.
That’s 10-fold cross-validation.

Nov Joy
  • 11
  • 2
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 19 '21 at 14:02