batch normalization, yes or no?

Question

I use Tensorflow 1.14.0 and Keras 2.2.4. The following code implements a simple neural network:

import numpy as np
np.random.seed(1)
import random
random.seed(2)
import tensorflow as tf
tf.set_random_seed(3)

from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Activation


x_train=np.random.normal(0,1,(100,12))

model = Sequential()
model.add(Dense(8, input_shape=(12,)))
# model.add(tf.keras.layers.BatchNormalization())
model.add(Activation('linear'))
model.add(Dense(12))
model.add(Activation('linear'))
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x_train, x_train,epochs=20, validation_split=0.1, shuffle=False,verbose=2)

The final val_loss after 20 epochs is 0.7751. When I uncomment the only comment line to add the batch normalization layer, the val_loss changes to 1.1230.

My main problem is way more complicated, but the same thing occurs. Since my activation is linear, it does not matter if I put the batch normalization after or before the activation.

Questions: Why batch normalization cannot help? Is there anything I can change so that the batch normalization improves the result without changing the activation functions?

Update after getting a comment:

An NN with one hidden layer and linear activations is kind of like PCA. There are tons of papers on this. For me, this setting gives minimal MSE among all combinations of activation functions for the hidden layer and output.

Some resources that state linear activations mean PCA:

https://arxiv.org/pdf/1702.07800.pdf

https://link.springer.com/article/10.1007/BF00275687

https://www.quora.com/How-can-I-make-a-neural-network-to-work-as-a-PCA

I can't seem to reproduce the same behaviour with tensorflow 2.0. Are you using tensorflow 1.x? — kluu, Oct 29 '19 at 19:30
@kluu Yes, I use Tensorflow 1.14.0 and Keras 2.2.4. What results do you get? Is that better with BN for you? — Albert, Oct 29 '19 at 20:39
Yes, it's better with BN. Using the exact same code (with the exception of the seed setter of tensorflow), I get a `val_loss: 1.1346` without BN and `val_loss: 1.0833` with the BN layer. — kluu, Oct 29 '19 at 21:01
Irrelevant (?) to the issue, but a NN consisting only of linear activations does not make much sense (and I highly doubt anyone has ever tried BN with such a configuration). — desertnaut, Oct 29 '19 at 22:57
@desertnaut I strongly disagree. An NN with one hidden layer and linear activations is kind of like PCA. There are tons of papers on this. For me, this setting gives minimal MSE among all combinations of activation functions for the hidden layer and output. — Albert, Oct 29 '19 at 23:15
PCA is a *linear* transformation, so no big surprises here; and if you want low MSE, you could try PCA itself, and get a *perfect* reconstruction (i.e. zero MSE). Not sure what exactly you are trying to do here though, and I may be missing something in your purpose... — desertnaut, Oct 29 '19 at 23:59
@desertnaut Thanks. But, the MSE on the test set has to be minimal. — Albert, Oct 30 '19 at 00:07
Didn't disagree on that. But first, before trying fancy modeling, we start with a *baseline*, which in this case should be simple plain old PCA; if we can't do better than this, we don't have a case actually (BTW, kindly share a couple of those papers, if you don't mind) — desertnaut, Oct 30 '19 at 00:14
@desertnaut Thanks for the info. NN is the only tool I can use to solve this problem. I am updating the question to add references for the PCA. — Albert, Oct 30 '19 at 00:29
sounds like a contrived problem since a NN is "the only tool you can use". for small data and small size of input vectors, I rarely get best results using NNs. is this homework/lab exercise? — user1269942, Oct 30 '19 at 00:41
@user1269942 No, it's a real-world problem. The NN is also a real-world restriction. BN does not help with even 100k samples. — Albert, Oct 30 '19 at 00:43
fair enough. since it is a small net, have you tried varying batch size and learning rate a bunch? — user1269942, Oct 30 '19 at 00:45
Interesting paper links. P.S., didn't forget about the other question. — OverLordGoldDragon, Oct 30 '19 at 02:05
@user1269942 I have tried batch size, not much the learning rate. Can you please let me know how I should check different learning rates? The default is 0.001, how can I sweep? just trying 0.01, 0.1, 1, 10? — Albert, Oct 30 '19 at 18:57

OverLordGoldDragon · Accepted Answer · 2019-10-30T16:40:25.147

Yes.

The behavior you're observing is a bug - and you don't need BN to see it; plot to the left is for #V1, to the right is for #V2:

#V1
model = Sequential()
model.add(Dense(8, input_shape=(12,)))
#model.add(Activation('linear')) <-- uncomment == #V2
model.add(Dense(12))
model.compile(optimizer='adam', loss='mean_squared_error')

Clearly nonsensical, as Activation('linear') after a layer with activation=None (=='linear') is an identity: model.layers[1].output.name == 'activation/activation/Identity:0'. This can be confirmed further by fetching and plotting intermediate layer outputs, which are identical for 'dense' and 'activation' - will omit here.

So, the activation does literally nothing, except it doesn't - somewhere along the commit chain between 1.14.0 and 2.0.0, this was fixed, though I don't know where. Results w/ BN using TF 2.0.0 w/ Keras 2.3.1 below:

val_loss = 0.840 # without BN
val_loss = 0.819 # with BN

Solution: update to TensorFlow 2.0.0, Keras 2.3.1.

Tip: use Anaconda w/ virtual environment. If you don't have any virtual envs yet, run:

conda create --name tf2_env --clone base
conda activate tf2_env
conda uninstall tensorflow-gpu
conda uninstall keras
conda install -c anaconda tensorflow-gpu==2.0.0
conda install -c conda-forge keras==2.3.1

May be a bit more involved than this, but that's subject of another question.

UPDATE: importing from keras instead of tf.keras also solves the problem.

Disclaimer: BN remains a 'controversial' layer in Keras, yet to be fully fixed - see Relevant Git; I plan on investigating it myself eventually, but for your purposes, this answer's fix should suffice.

I also recommend familiarizing yourself with BN's underlying theory, in particular regarding its train vs. inference operation; in a nutshell, batch sizes under 32 is a pretty bad idea, and dataset should be sufficiently large to allow BN to accurately approximate test-set gamma and beta.

Code used:

x_train=np.random.normal(0, 1, (100, 12))

model = Sequential()
model.add(Dense(8, input_shape=(12,)))
#model.add(Activation('linear'))
#model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(12))
model.compile(optimizer='adam', loss='mean_squared_error')

W_sum_all = []  # fit rewritten to allow runtime weight collection
for _ in range(20):
    for i in range(9):
        x = x_train[i*10:(i+1)*10]
        model.train_on_batch(x, x)

        W_sum_all.append([])
        for layer in model.layers:
            if layer.trainable_weights != []:
                W_sum_all[-1] += [np.sum(layer.get_weights()[0])]
model.evaluate(x[-10:], x[-10:])

plt.plot(W_sum_all)
plt.title("Sum of weights (#V1)", weight='bold', fontsize=14)
plt.legend(labels=["dense", "dense_1"], fontsize=14)
plt.gcf().set_size_inches(7, 4)

Imports/pre-executions:

import numpy as np
np.random.seed(1)
import random
random.seed(2)
import tensorflow as tf
if tf.__version__[0] == '2':
    tf.random.set_seed(3)
else:
    tf.set_random_seed(3)

import matplotlib.pyplot as plt
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Activation

Many thanks for investigating this and I appreciate your time. I did not know about the bug. It is weird. I have upvoted your answer. I would like to see other people's answers as well. — Albert, Oct 30 '19 at 18:56
@Albert very low chance you'll see anything even *remotely* close to the depth & quality of this answer; be sure to also check OP's incredible investigation on [TensorFlow 2 vs TensorFlow 1 performance issues](https://stackoverflow.com/questions/58441514/why-is-tensorflow-2-much-slower-than-tensorflow-1). — desertnaut, Nov 01 '19 at 12:20
@desertnaut Mind if I use your comment in my resume? Something like -- [Why is TF2...] - "incredible investigation", C. I. Tsatsoulis, Lead Data Scientist at Nodalpoint Systems. -- Could score some points with this - doubt anyone will bug you over it — OverLordGoldDragon, Nov 03 '19 at 17:22
@OverLordGoldDragon by all means! Although SO comments are "second class citizens", and they may be removed without warning; so, I just posted a [tweet](https://twitter.com/desertnaut/status/1191045432996175872) for you ;) — desertnaut, Nov 03 '19 at 17:35

batch normalization, yes or no?

1 Answers1

Linked