Why does my CIFAR 100 CNN model mainly predict two classes?

Question

I am currently trying to get a decent score (> 40% accuracy) with Keras on CIFAR 100. However, I'm experiencing a weird behaviour of a CNN model: It tends to predict some classes (2 - 5) much more often than others:

The pixel at position (i, j) contains the count how many elements of the validation set from class i were predicted to be of class j. Thus the diagonal contains the correct classifications, everything else is an error. The two vertical bars indicate that the model often predicts those classes, although it is not the case.

CIFAR 100 is perfectly balanced: All 100 classes have 500 training samples.

Why does the model tend to predict some classes MUCH more often than other classes? How can this be fixed?

The code

Running this takes a while.

#!/usr/bin/env python

from __future__ import print_function
from keras.datasets import cifar100
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
import numpy as np

batch_size = 32
nb_classes = 100
nb_epoch = 50
data_augmentation = True

# input image dimensions
img_rows, img_cols = 32, 32
# The CIFAR10 images are RGB.
img_channels = 3

# The data, shuffled and split between train and test sets:
(X, y), (X_test, y_test) = cifar100.load_data()
X_train, X_val, y_train, y_val = train_test_split(X, y,
                                                  test_size=0.20,
                                                  random_state=42)

# Shuffle training data
perm = np.arange(len(X_train))
np.random.shuffle(perm)
X_train = X_train[perm]
y_train = y_train[perm]

print('X_train shape:', X_train.shape)
print(X_train.shape[0], 'train samples')
print(X_val.shape[0], 'validation samples')
print(X_test.shape[0], 'test samples')

# Convert class vectors to binary class matrices.
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
Y_val = np_utils.to_categorical(y_val, nb_classes)

model = Sequential()

model.add(Convolution2D(32, 3, 3, border_mode='same',
                        input_shape=X_train.shape[1:]))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Convolution2D(64, 3, 3, border_mode='same'))
model.add(Activation('relu'))
model.add(Convolution2D(64, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(1024))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

X_train = X_train.astype('float32')
X_val = X_val.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_val /= 255
X_test /= 255

if not data_augmentation:
    print('Not using data augmentation.')
    model.fit(X_train, Y_train,
              batch_size=batch_size,
              nb_epoch=nb_epoch,
              validation_data=(X_val, y_val),
              shuffle=True)
else:
    print('Using real-time data augmentation.')
    # This will do preprocessing and realtime data augmentation:
    datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=0,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

    # Compute quantities required for featurewise normalization
    # (std, mean, and principal components if ZCA whitening is applied).
    datagen.fit(X_train)

    # Fit the model on the batches generated by datagen.flow().
    model.fit_generator(datagen.flow(X_train, Y_train,
                                     batch_size=batch_size),
                        samples_per_epoch=X_train.shape[0],
                        nb_epoch=nb_epoch,
                        validation_data=(X_val, Y_val))
    model.save('cifar100.h5')

Visualization code

#!/usr/bin/env python


"""Analyze a cifar100 keras model."""

from keras.models import load_model
from keras.datasets import cifar100
from sklearn.model_selection import train_test_split
import numpy as np
import json
import io
import matplotlib.pyplot as plt
try:
    to_unicode = unicode
except NameError:
    to_unicode = str

n_classes = 100


def plot_cm(cm, zero_diagonal=False):
    """Plot a confusion matrix."""
    n = len(cm)
    size = int(n / 4.)
    fig = plt.figure(figsize=(size, size), dpi=80, )
    plt.clf()
    ax = fig.add_subplot(111)
    ax.set_aspect(1)
    res = ax.imshow(np.array(cm), cmap=plt.cm.viridis,
                    interpolation='nearest')
    width, height = cm.shape
    fig.colorbar(res)
    plt.savefig('confusion_matrix.png', format='png')

# Load model
model = load_model('cifar100.h5')

# Load validation data
(X, y), (X_test, y_test) = cifar100.load_data()

X_train, X_val, y_train, y_val = train_test_split(X, y,
                                                  test_size=0.20,
                                                  random_state=42)

# Calculate confusion matrix
y_val_i = y_val.flatten()
y_val_pred = model.predict(X_val)
y_val_pred_i = y_val_pred.argmax(1)
cm = np.zeros((n_classes, n_classes), dtype=np.int)
for i, j in zip(y_val_i, y_val_pred_i):
    cm[i][j] += 1

acc = sum([cm[i][i] for i in range(100)]) / float(cm.sum())
print("Validation accuracy: %0.4f" % acc)

# Create plot
plot_cm(cm)

# Serialize confusion matrix
with io.open('cm.json', 'w', encoding='utf8') as outfile:
    str_ = json.dumps(cm.tolist(),
                      indent=4, sort_keys=True,
                      separators=(',', ':'), ensure_ascii=False)
    outfile.write(to_unicode(str_))

Red herrings

tanh

I've replaced tanh by relu. The history csv looks ok, but the visualization has the same problem:

Please also note that the validation accuracy here is only 3.44%.

Dropout + tanh + border mode

Removing dropout, replacing tanh by relu, setting border mode to same everywhere: history csv

The visualization code still gives a much lower accuracy (8.50% this time) than the keras training code.

Q & A

The following is a summary of the comments:

The data is evenly distributed over the classes. So there is no "over training" of those two classes.
Data augmentation is used, but without data augmentation the problem persists.
The visualization is not the problem.

I would try to increase the number of filters twice or reduce the `filter_size` to `(2, 2)`. The receptive field of each unit which gets into `Dense` layer is bigger then a half of image size. This might cause problems. — Marcin Możejko, Mar 10 '17 at 10:38
See also: [Neural network always predicts the same class](http://stackoverflow.com/a/41493375/562769) — Martin Thoma, May 07 '17 at 11:10

yhenon · Accepted Answer · 2017-03-10T16:14:53.310

6

If you get good accuracy during training and validation, but not when testing, make sure you do exactly the same preprocessing on your dataset in both cases. Here you have when training:

X_train /= 255
X_val /= 255
X_test /= 255

But no such code when predicting for your confusion matrix. Adding to testing:

X_val /=  255.

Gives the following nice looking confusion matrix:

edited Mar 10 '17 at 16:14

answered Mar 10 '17 at 16:09

yhenon

4,111
1
18
34

sascha · Answer 2 · 2017-03-09T22:52:53.270

0

I don't have a good feeling with this part of the code:

model.add(Dense(1024))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

The remaining model is full of relus, but here there is a tanh.

tanh sometimes vanishes or explodes (saturates at -1 and 1), which might lead to your 2-class overimportance.

keras-example cifar 10 basically uses the same architecture (dense-layer sizes might be different), but also uses a relu there (no tanh at all). The same goes for this external keras-based cifar 100 code.

edited Mar 09 '17 at 22:52

answered Mar 09 '17 at 22:47

sascha

32,238
6
68
110

This was a red hering. Tanh is not the problem. At least not tanh alone. See my updated answer. – Martin Thoma Mar 10 '17 at 09:40

score 0 · Answer 3 · answered Mar 10 '17 at 14:11

0

One important part of the problem was that my ~/.keras/keras.json was

{
    "image_dim_ordering": "th",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "tensorflow"
}

Hence I had to change image_dim_ordering to tf. This leads to

and an accuracy of 12.73%. Obviously, there is still a problem as the validation history gave 45.1% accuracy.

answered Mar 10 '17 at 14:11

Martin Thoma

124,992
159
614
958

1

You divide by 255 in training but not in testing. Probably accounts for the difference. – yhenon Mar 10 '17 at 15:45
@y300 Yes! That was it! Thank you :-) If you post it as an answer, I'll accept it. – Martin Thoma Mar 10 '17 at 15:56
Done, see below. – yhenon Mar 10 '17 at 16:15

score -1 · Answer 4 · answered Mar 10 '17 at 01:00

I don't see you doing mean-centering, even in datagen. I suspect this is the main cause. To do mean centering using ImageDataGenerator, set featurewise_center = 1. Another way is to subtract the ImageNet mean from each RGB pixel. The mean vector to be subtracted is [103.939, 116.779, 123.68].
Make all activations relus, unless you have a specific reason to have a single tanh.
Remove two dropouts of 0.25 and see what happens. If you want to apply dropouts to convolution layer, it is better to use SpatialDropout2D. It is somehow removed from Keras online documentation but you can find it in the source.
You have two conv layers with same and two with valid. There is nothing wrong in this, but it would be simpler to keep all conv layers with same and control your size just based on max-poolings.

Except for the mean subtraction, I did apply all changes you thought that could be the reason (see updated question). Still the same problem. I'm now running it with mean subtraction, but I'm pretty sure this is still not the problem. — Martin Thoma, Mar 10 '17 at 11:15