0

I have a strange Keras classification behaviour.

I got different accuracy when using cross-validation vs a holdout set.

2 Identical models but with different evaluation methods:

  • Model1 uses 10-Kfold cross validation (achieving 0.98 mean AUC and lowest AUC 0.89).
  • Model2 uses hold out set (Accuracy 0.82)

I was expecting to see the worst accuracy of model 2 to be the lowest fold accuracy (0.89 not 0.82).

Data of a small size ~10k x 13

Kfold: 10 folds

Model 1:

def create_baseline():
    # create model
    model = models.Sequential()
    model.add(layers.Dense(64, input_dim=set_1.iloc[:,0:-1].shape[1], activation='relu'))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

This the important part of my code (the rest is related to ploting the ROC):

Note: I have tried both with and without standardization

estimators = []
estimators.append(('standardize', MinMaxScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, nb_epoch=1000, batch_size=1000, verbose=0)))
pipeline = Pipeline(estimators)
cv = StratifiedKFold(n_splits=10)
classifier = pipeline
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)

colors = cycle(['cyan', 'indigo', 'seagreen', 'yellow', 'blue', 'darkorange'])
lw = 2
i = 0
for (train, test), color in zip(cv.split(X, y), colors):
    classifier.fit(X[train], y[train])
    probas_ = classifier.predict_proba(X[test])
    fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=lw, color=color,
             label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

    i += 1

Output: enter image description here

As you can see, I have .98 averag ROC.

Issue:

Model 2:

std = MinMaxScaler()
X_norm = std.fit_transform(X)
X_train_norm, X_test_norm, y_train_norm, y_test_norm = train_test_split(X_norm, y, test_size=0.1, random_state=5)

Keras model

model_2 = models.Sequential()
model_2.add(layers.Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
model_2.add(layers.Dense(64, activation='relu'))
model_2.add(layers.Dense(1, activation='sigmoid'))
model_2.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

Runing the model:

history = model_2.fit(X_train_norm,
y_train_norm,
epochs=1000,
batch_size=1000,
validation_data=(X_test_norm, y_test_norm))

Results (last iterations):

8988/8988 [==============================] - 0s - loss: 0.3517 - acc: 0.8249 - val_loss: 0.3701 - val_acc: 0.7954
Epoch 997/1000
8988/8988 [==============================] - 0s - loss: 0.3516 - acc: 0.8238 - val_loss: 0.3699 - val_acc: 0.8059
Epoch 998/1000
8988/8988 [==============================] - 0s - loss: 0.3516 - acc: 0.8250 - val_loss: 0.3694 - val_acc: 0.8038
Epoch 999/1000
8988/8988 [==============================] - 0s - loss: 0.3512 - acc: 0.8241 - val_loss: 0.3692 - val_acc: 0.7975
Epoch 1000/1000
8988/8988 [==============================] - 0s - loss: 0.3504 - acc: 0.8247 - val_loss: 0.3696 - val_acc: 0.7975

​Why the performance of model2 is lower than model1?

Note: - same data, type of keras model, and seed but different results! - I did multiple test with and without standarization and with same and differents seeds, and I still have the same issue. - I understant that I can use simpler models but my issue is realted to using Keras classifier.

Please correct me if I am doing something wrong.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Well, my answer might sound dumb but isn't it what K-fold validation are meant to be? (i.e. a way to optimize the distribution between your training set and your dev set). The more you split your data into sub sets to distribute between train and dev, the more likely you are to get a representative distribution of your population (hence a better accuracy) – zar3bski Jun 05 '18 at 22:37
  • 2
    I'm voting to close this question as off-topic because the question is fundamentally flawed at its premise. It is based on a comparison of two metrics that are incomparable. Mathematical synesthesia. It is like asking why the color red doesn't sound like clapping. – Engineero Jun 08 '18 at 14:20

1 Answers1

3

You seem a little confused...

Why the performance of model2 is lower than model1?

It is not; to be precise, nothing in your results shows that it is either lower or higher.

2 Identical models but with different evaluation methods

You not only use different evaluation methods (CV vs validation set), you also use different metrics: comparing the area under the ROC curve, i.e. the AUC (model 1) with accuracy (model 2) is exactly like comparing apples to oranges...

These metrics are not only different, they are fundamentally different and used for completely different purposes:

  • The accuracy involves implicitly a threshold applied to the computed probabilities; roughly speaking and for binary classification, when the computed probability of a sample is higher than this threshold, the sample is classified as 1, otherwise it is classified as 0. The accuracy is calculated after this threshold has been applied, and the results are either 0 or 1 (this answer of mine explains in detail the procedure). Usually (and in your case here), this threshold is implicitly set to 0.5.

  • The ROC curve (and the AUC) do not involve final "hard" classifications (0/1), but the previous stage, i.e. the computed probabilities given by the model, and they actually give an aggregated performance of a binary classifier averaged over all possible thresholds. Consequently, the ROC & AUC have little to say about a final, deployed model, which always includes the decision threshold mentioned above, about which chosen threshold the ROC curve itself says nothing (see here for a more detailed exposition).

UPDATE (after a lenghty discussion in the comments, which unfortunately didn't help to clarify things):

To convince yourself that the case is as I have explained, try performing your model 1 CV but reporting the accuracy instead of ROC; this will restore the all other being equal condition, necessary for such investigations. You'll see that the accuracy will be indeed comparable to the one of your model 2.

Please correct me if I am doing something wrong.

You can't say I didn't try...

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • @EuEgyEuEg **You cannot say that "the ROC is higher than the accuracy"**. These are different things, and you cannot compare them! – desertnaut Jun 08 '18 at 13:46
  • Maybe is wrong to use word accuracy (as ROC shows the "accuracy" at different threshold over the probabilities). as share in my link –  Jun 08 '18 at 13:52