Keras BatchNorm Layer giving weird results during training and inference

Question

I am having an issue with Keras where evaluate function gives different training loss (way higher) and accuracy(way lower) value as compared to the value that I get during training. I am aware that this question has already been asked at several places (here, here), but I think my issue is different and still not answered in those forums.

Explanation of the Task

It is supposed to be a very simple task. All I am doing is to overfit to my own dataset of 256 images (29x29x3) with 256 output classes (one for each image).

Dataset

Case 1
x_train = All the pixel values in the image = i where i goes from 0 to 255.
y_train = i

Case 2
x_train = Centre 5*5 patch of the pixel values in the image = i where i goes from 0 to 255. All the other pixel values are same for all the images.
y_train = i

This gives me 256 images in total for the training data in each case. (It would be more clear if you just have a look at the code)

Here is my code to reproduce the issue -

from __future__ import print_function

import os

import keras
from keras.datasets import mnist
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D, Activation
from keras.layers.normalization import BatchNormalization
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, Callback
from keras import backend as K
from keras.regularizers import l2

import matplotlib.pyplot as plt
import PIL.Image
import numpy as np
from IPython.display import clear_output

# The GPU id to use, usually either "0" or "1"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"

# To suppress the warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

## Hyperparamters
batch_size = 256
num_classes = 256
l2_reg=0.0
epochs = 500


## input image dimensions
img_rows, img_cols = 29, 29

## Train Image (I took a random image from ImageNet)
train_img_name = 'n01871265_279.JPEG'
ret = PIL.Image.open(train_img_name) #Opening the image
ret = ret.resize((img_rows, img_cols)) #Resizing the image
img = np.asarray(ret, dtype=np.uint8).astype(np.float32) #Converting it to numpy array
print(img.shape) # (29, 29, 3)

## Creating the training data
#############################
x_train = np.zeros((256, img_rows, img_cols, 3))
y_train = np.zeros((256,), dtype=int)
for i in range(len(y_train)):
    temp_img = np.copy(img)
    ## Case1 of dataset
    # temp_img[:, :, :] = i # changing all the pixel values
    ## Case2 of dataset
    temp_img[12:16, 12:16, :] = i # changing the centre block of 5*5 pixels
    x_train[i, :, :, :] = temp_img
    y_train[i] = i
##############################

## Common stuff in Keras
if K.image_data_format() == 'channels_first':
    print('Channels First')
    x_train = x_train.reshape(x_train.shape[0], 3, img_rows, img_cols)
    input_shape = (3, img_rows, img_cols)
else:
    print('Channels Last')
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 3)
    input_shape = (img_rows, img_cols, 3)

## Normalizing the pixel values
x_train = x_train.astype('float32')
x_train /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')

## convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)

## Model definition
def model_toy(mom):
    model = Sequential()

    model.add( Conv2D(filters=64, kernel_size=(7, 7), strides=(1,1), input_shape=input_shape, kernel_regularizer=l2(l2_reg)) )
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))
    #Default parameters kept same as PyTorch
    #Meaning of PyTorch momentum is different from Keras momentum.
    # PyTorch mom = 0.1 is same as Keras mom = 0.9

    model.add( Conv2D(filters=128, kernel_size=(7, 7), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add(Conv2D(filters=256, kernel_size=(5, 5), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add(Conv2D(filters=512, kernel_size=(5, 5), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add(Conv2D(filters=1024, kernel_size=(5, 5), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add( Conv2D( filters=2048, kernel_size=(3, 3), strides=(1, 1), kernel_regularizer=l2(l2_reg) ) )
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    model.add(Conv2D(filters=4096, kernel_size=(3, 3), strides=(1, 1), kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))


    # Passing it to a dense layer
    model.add(Flatten())

    model.add(Dense(1024, kernel_regularizer=l2(l2_reg)))
    model.add(Activation('relu'))
    model.add(BatchNormalization(momentum=mom, epsilon=0.00001))

    # Output Layer
    model.add(Dense(num_classes, kernel_regularizer=l2(l2_reg)))
    model.add(Activation('softmax'))

    return model

mom = 0.9 #0
model = model_toy(mom)
model.summary()

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(lr=0.001),
              #optimizer=keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.0, nesterov=True),
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    shuffle=True,
                   )

print('Training results')
print('-------------------------------------------')
score = model.evaluate(x_train, y_train, verbose=1)
print('Training loss:', score[0])
print('Training accuracy:', score[1])
print('-------------------------------------------')

Small Note - I was able to successfully do this task in PyTorch. It is just that my actual task requires me to have a Keras model. That's why I have changed the default values of the BatchNorm layer (the root cause of the issue) according to the ones I used to train PyTorch model.

Here is the image that I used in my code.

Here are the results of training.
Case1 of the dataset
Case2 of the dataset

If you look at these two files, you would be able to notice the discrepancies in the training loss during training vs inference.
(I have set my batch size to be equal to the size of my training data so as to avoid some the reasons BatchNorm generally creates problems as mentioned here)

Next, I looked at the source code of the Keras to see if there is any way I can make the BatchNorm layer use the batch statistics instead of the running mean and variance.
Here is the update formula that Keras (backend - TF) uses to update the running mean and variance.

#running_stat -= (1 - momentum) * (running_stat - batch_stat)

So if I set the momentum value to be 0, it would mean that the value assigned to the runing_stat would always be equal to batch_stat during the training phase. Thus, the value it will use during inference mode will also be same (close) as batch/dataset statistics.
Here are the results for this little experiment with the same issue still occurring.
Case1 of the dataset
Case2 of the dataset

Programming Environment - Python-3.5.2, tensorflow-1.10.0, keras-2.2.4

I tried the same thing with tensorflow-1.12.0, keras-2.2.2 as well but it still did not solve the issue.

Keras BatchNorm Layer giving weird results during training and inference

Explanation of the Task

Dataset

0 Answers0