I use Tensorflow 1.14.0 and Keras 2.2.4. The following code implements a simple neural network:
import numpy as np
np.random.seed(1)
import random
random.seed(2)
import tensorflow as tf
tf.set_random_seed(3)
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Input, Dense, Activation
x_train=np.random.normal(0,1,(100,12))
model = Sequential()
model.add(Dense(8, input_shape=(12,)))
# model.add(tf.keras.layers.BatchNormalization())
model.add(Activation('linear'))
model.add(Dense(12))
model.add(Activation('linear'))
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(x_train, x_train,epochs=20, validation_split=0.1, shuffle=False,verbose=2)
The final val_loss after 20 epochs is 0.7751. When I uncomment the only comment line to add the batch normalization layer, the val_loss changes to 1.1230.
My main problem is way more complicated, but the same thing occurs. Since my activation is linear, it does not matter if I put the batch normalization after or before the activation.
Questions: Why batch normalization cannot help? Is there anything I can change so that the batch normalization improves the result without changing the activation functions?
Update after getting a comment:
An NN with one hidden layer and linear activations is kind of like PCA. There are tons of papers on this. For me, this setting gives minimal MSE among all combinations of activation functions for the hidden layer and output.
Some resources that state linear activations mean PCA:
https://arxiv.org/pdf/1702.07800.pdf
https://link.springer.com/article/10.1007/BF00275687
https://www.quora.com/How-can-I-make-a-neural-network-to-work-as-a-PCA