5

I'm studying some cross_validation scores on my dataset using cross_val_score and KFold In particular my code looks like this:

cross_val_score(estimator=model, X=X, y=y, scoring='r2', cv=KFold(shuffle=True))

My question is if it's a common behaviour to put shuffle=True inside the KFold: if I do it the returns on the r2 score are:

[0.5934, 0.60432, 0.45689, 0.6875, 0.5678]

If I put shuffle=False it returns

[0.3987, 0,4576, 0.3234, 0.4567. 0.3233]

I would not want that same points that are used for training on an iteration would be then reconsidered for the next, ending up with an optimistic score for the cross validation.. How should I explain that I get better scores using shuffle=True ?

James Arten
  • 523
  • 5
  • 16

1 Answers1

4

The general procedure for cross-validation requires the dataset to be shuffled randomly.

If data is unordered in nature (i.e. non - Time series) then shuffle=True is right choice.

Note :

train_test_split in sklearn has shuffle=True by default.

Further Read :

https://scikit-learn.org/stable/modules/cross_validation.html#a-note-on-shuffling

https://www.kaggle.com/general/236904

ah bon
  • 9,293
  • 12
  • 65
  • 148