Difference between cross_val_score and another way of calculating accuracy

Question

I tried to calculate the accuracy and was puzzled by the fact that cross_val_score gives a rather low result, than by comparing the predicted results with the correct.

First way of counting, that gives

[0.8033333333333333, 0.7908333333333334, 0.8033333333333333, 0.7925,0.8066666666666666]

kf = KFold(shuffle=True, n_splits=5)
scores = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    scores.append(np.sum(y_pred == y_test) / len(y_test))

Second way gives array([0.46166667, 0.53583333, 0.40916667, 0.44666667, 0.3775 ]):

model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
cross_val_score(model, X, y, cv = 5, scoring='accuracy')

What's my mistake?

score 1 · Answer 1 · answered Sep 10 '18 at 07:25

cross_val_score will use a StratifiedKFold cv iterator when not specified otherwise. A StratifiedKFold will keep the ratio of classes balanced the same way in train and test split. For more explanation, see my other answer here:-

https://stackoverflow.com/a/48314533/3374996

On the other hand, in your first approach you are using KFold which will not keep the balance of classes. In addition you are doing shuffling of data in that.

So in each fold, there is data difference in your two approaches and hence different results.

score -1 · Answer 2 · answered Sep 09 '18 at 23:22

The low score in cross_val_score is probably because of the fact that you are providing the complete data to it, instead of breaking it into test and training set. This generally leads to leakage of information which results in your model giving incorrect predictions. See this post for more explanation.

References

Learn the right way to validate models

Difference between cross_val_score and another way of calculating accuracy

2 Answers2