Clarification needed for the cross_val_score function of sklearn

Question

I am using cross_val_score function with LeaveOneOut function as my data has 60 samples.

I am confused on how cross_val_score computes the results for each estimation in Leave One Out cross validation (LOOCV).

In the LOOCV, for one instance, it fits, let's say Decision Trees Classifier (DTC), model using 59 samples for training and predicts the single remaining one.

Then the main question is this: Does it fit a new model at each instance (namely 60 different fits) inside cross_val_score?

If so, things get confusing.

Then I can have an average accuracy (out of 60) score for performance evaluation. But I need to come up with a best DTC model in general not just for my own data, though it is based my data.

If I use the entire data, the it fits perfectly but that model simply over-fits.

I want to have a single DTC model that works best in general based on my data.

Here is my code if that make sense:

    model = DecisionTreeClassifier(random_state=27, criterion='gini', max_depth=4, max_features='auto' )
    loocv = LeaveOneOut()
    results = cross_val_score(model, X, y, cv=loocv)

score 0 · Accepted Answer · answered Jun 11 '18 at 20:41

0

I do not fully understand what do you want to find out.

Does it fit a new model at each instance (namely 60 different fits) inside cross_val_score?`

Yes, it does in your case. What is the follow up question to help to clarify the confusion that you have in such case?

The idea of the CV is that one gets a performance estimate of the model building procedure that you have chosen. The final model can (and should to benefit most from the data) be built on the full dataset. Then you can use it to predict on test data and you can use your cross_val_score outcome to get an estimate of performance for this model. See more elaborate answer as well as very useful links in my earlier answer.

My answer applies to a larger dataset. There might be nuisances related to small dataset treatment, that I'm not aware of, but I do not see why the logic does not generalise to this case.

answered Jun 11 '18 at 20:41

Mischa Lisovyi

3,207
18
29

I need to get the model that works best in general. I see that with LOOCV, I can have an average performance of 60 different models on my data. Now, I need a single best model of general usage. If I use all the data, then DTC model perfectly fits to the data and the performance is 100%, namely overfits. Should I use this DTC model, trained on the entire datasets, for general usage and say its average performance is the one I got from LOOCV? However, in that case, I realized that this overfitting model do not consider one of the features, which appears to be important for the CV performance – entropy Jun 11 '18 at 21:15
Yes, that's what you have to rely on. You either make an assumption, that the model and its performance generalize and thus the performance estimate from CV with LOO is representative or you can not make tat assumption and then whole CV is pointless and you are in trouble on not being able to generalize at all. I encourage you to read the discussion in the links added to my earlier reply- it helps to get into the right mindset. A small clarification `do not consider one of the features`- I suppose you mean `example` instead of `feature`. Or do you have 60 features? – Mischa Lisovyi Jun 11 '18 at 22:12
No, I have 60 samples and 5 features. If I plot the graph of DTC (after fitting the entire 60 samples), I do not see one of the 5 features used in the rules (also as 0 in the feature importance plots). But, I know it helps increasing the average accuracy from LOOCV analysis. – entropy Jun 11 '18 at 22:20

Clarification needed for the cross_val_score function of sklearn

1 Answers1