2

I tried to calculate the accuracy and was puzzled by the fact that cross_val_score gives a rather low result, than by comparing the predicted results with the correct.

First way of counting, that gives

[0.8033333333333333, 0.7908333333333334, 0.8033333333333333, 0.7925,0.8066666666666666]

kf = KFold(shuffle=True, n_splits=5)
scores = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    scores.append(np.sum(y_pred == y_test) / len(y_test))

Second way gives array([0.46166667, 0.53583333, 0.40916667, 0.44666667, 0.3775 ]):

model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
cross_val_score(model, X, y, cv = 5, scoring='accuracy')

What's my mistake?

Cœur
  • 37,241
  • 25
  • 195
  • 267

2 Answers2

1

cross_val_score will use a StratifiedKFold cv iterator when not specified otherwise. A StratifiedKFold will keep the ratio of classes balanced the same way in train and test split. For more explanation, see my other answer here:-

On the other hand, in your first approach you are using KFold which will not keep the balance of classes. In addition you are doing shuffling of data in that.

So in each fold, there is data difference in your two approaches and hence different results.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
-1

The low score in cross_val_score is probably because of the fact that you are providing the complete data to it, instead of breaking it into test and training set. This generally leads to leakage of information which results in your model giving incorrect predictions. See this post for more explanation.

References

Gambit1614
  • 8,547
  • 1
  • 25
  • 51