What does it mean when training and validation accuracy are 1.000 but results are still poor?

Question

I am using Keras to perform landmark detection - specifically locating parts of the body on a picture of a human. I have gathered around 2,000 training samples and am using rmsprop w/ mse loss function. After training my CNN, I am left with loss: 3.1597e-04 - acc: 1.0000 - val_loss: 0.0032 - val_acc: 1.0000

I figured this would mean my model would perform well on the test data, however, instead the predicted points are way off from the labeled points. Any ideas or help would be greatly appreciated!

IMG_SIZE = 96
NUM_KEYPOINTS = 15
NUM_EPOCHS = 50
NUM_CHANNELS = 1

TESTING = True

def load(test=False):

    # load data from CSV file
    df = pd.read_csv(fname)

    # convert Image to numpy arrays
    df['Image'] = df['Image'].apply(lambda im: np.fromstring(im, sep=' '))
    df = df.dropna()    # drop rows with missing values

    X = np.vstack(df['Image'].values) / 255.    # scale pixel values to [0, 1]
    X = X.reshape(X.shape[0], IMG_SIZE, IMG_SIZE, NUM_CHANNELS)
    X = X.astype(np.float32)

    y = df[df.columns[:-1]].values
    y = (y - (IMG_SIZE / 2)) / (IMG_SIZE / 2)   # scale target coordinates to [-1, 1]
    X, y = shuffle(X, y, random_state=42)   # shuffle train data
    y = y.astype(np.float32)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

    return X_train, X_test, y_train, y_test

def build_model():

    # construct the neural network
    model = Sequential()

    model.add(Conv2D(16, (3, 3), activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, NUM_CHANNELS)))
    model.add(MaxPooling2D(2, 2))

    model.add(Conv2D(32, (3, 3), activation='relu'))
    model.add(MaxPooling2D(2, 2))

    model.add(Conv2D(64, (3, 3), activation='relu'))
    model.add(MaxPooling2D(2, 2))

    model.add(Flatten())
    model.add(Dropout(0.5))
    model.add(Dense(500, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(NUM_KEYPOINTS * 2))

    return model


if __name__ == '__main__':

    X_train, X_test, y_train, y_test = load(test=TESTING)

    model = build_model()

    sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
    model.compile(optimizer=sgd, loss='mse', metrics=['accuracy'])
    hist = model.fit(X_train, y_train, epochs=NUM_EPOCHS, verbose=1, validation_split=0.2)

    # save the model
    model.save_weights("/output/model_weights.h5")
    histFile = open("/output/training_history", "wb")
    pickle.dump(hist.history, histFile)

Just a guess: Do you scale your data before training? Are you applying the same scaling to your data during testing? — Fariborz Ghavamian, Mar 16 '18 at 22:47
You could overfit to the validation set. Is your validation set and test set coming from the same distribution (do they look like one another)? Maybe not a bad idea to select a different validation and test sets and see what happens. — Fariborz Ghavamian, Mar 17 '18 at 06:02
If you have labels for your test data, then make test data your validation data and vice-versa. Repeat the experiment again. See if you get the same results. Then update here. — Autonomous, Mar 19 '18 at 20:33
If after switching val/test sets, you get same results, then that probably means that you are not loading your trained network correctly while you are testing on the new data. — Autonomous, Mar 19 '18 at 23:46

score 1 · Accepted Answer · answered Mar 23 '18 at 10:30

According to this question How does keras define "accuracy" and "loss"? your "accuracy" is defined as categorical accuracy which makes absolutely no sense for your problem.

After training you are left with a 10x difference between your training loss and validation loss which would suggest overfitting (hard to say for sure without a graph and some examples).

To start fixing it:

Use a metric that makes sense in your context and you understand what it does and how it's computed.
Take random examples where the metric is very good and where is very bad and manually validate that that is really the case (otherwise you need a different metric).

In your case I would imagine a metric based on the distance between the desired location and the predicted ones. This is not a default thing and you would have to implement it yourself.

Always be suspicious if the model says it's perfect.

Thank you so much for the answer! This was dead on. I figured out that the built-in categorical accuracy was simply comparing the predicted and actual values. The problem was that in doing so it rounds the values and when using small, normalized values in the range of 0 to 1 it would get messed up. It would round 0.76 and 0.99 to 1 and compare the two and say that it was 100% accurate. I was able to build a custom metric to measure the pixel error instead and things are now much clearer. Thank you! — user3647894, Mar 24 '18 at 12:38

score 0 · Answer 2 · answered Mar 19 '18 at 15:13

It is impossible to tell from your question, but I will venture a guess here by some implications of your data split.

Typically, when one splits one's data into more than two sets, one is using all but one of them to train on some parameter or another. For example, the first split is used to choose the model weights, the second split to choose the model architecture, etc. Presumably you are training something with your 'validation' set, otherwise you wouldn't have it. Thus, the problem is almost certainly overfitting. The way that you detect overfitting, usually, is the difference in the accuracy of your model on data used to train your model (which is usually everything except for one single split) which you are calling your 'training' and 'validation' splits, and the accuracy of a split which your model has not touched, which you are calling your 'test' split.

So, per your question-comment "I assume if the validation accuracy is that high then there is no overfitting, right?". No. If the difference between the accuracy of your model on any data that you've used to train anything at all is higher than the accuracy of your model on data that your model has never touched in any way shape form or fashion, then you've overfit. Which seems to be the case with you.

OTOH, it may be the case that you've simply not shuffled your data. It's impossible to tell without having a look-see at the training/testing pipeline.

What does it mean when training and validation accuracy are 1.000 but results are still poor?

2 Answers2