Is my code right to use batch normalization layers in tensorflow?

Question

I have two inputs: qi_pos & qi_neg with the same shape. They should be processed by the two mlp layers, and finally get two results which acts as score. Here is my codes:

  self.mlp1_pos  =    nn_layers.full_connect_(qi_pos,        256, activation='relu', use_bn = None, keep_prob=self.keep_prob,  name = 'deep_mlp_1')
  self.mlp2_pos  =    nn_layers.full_connect_(self.mlp1_pos, 128,  activation='relu', use_bn = True, keep_prob=self.keep_prob,  name = 'deep_mlp_2')
  self.pos_pair_sim = nn_layers.full_connect_(self.mlp2_pos,  1,  activation=None, use_bn = True, keep_prob=self.keep_prob,  name = 'deep_mlp_3')
  tf.get_variable_scope().reuse_variables()
  self.mlp1_neg  =    nn_layers.full_connect_(qi_neg,        256, activation='relu', use_bn = None, keep_prob=self.keep_prob,  name = 'deep_mlp_1')
  self.mlp2_neg  =    nn_layers.full_connect_(self.mlp1_neg, 128,  activation='relu', use_bn = True, keep_prob=self.keep_prob,  name = 'deep_mlp_2')
  self.neg_pair_sim = nn_layers.full_connect_(self.mlp2_neg,  1,  activation=None, use_bn = True, keep_prob=self.keep_prob,  name = 'deep_mlp_3')

I use BN layer to normalize the nodes in hidden layers:

def full_connect_(inputs, num_units, activation=None, use_bn = None, keep_prob = 1.0, name='full_connect_'):
  with tf.variable_scope(name):
    shape = [inputs.get_shape()[-1], num_units]
    weight = weight_variable(shape)
    bias = bias_variable(shape[-1])
    outputs_ = tf.matmul(inputs, weight) + bias
    if use_bn:
        outputs_ = tf.contrib.layers.batch_norm(outputs_, center=True, scale=True, is_training=True,decay=0.9,epsilon=1e-5, scope='bn')
    if activation=="relu":
      outputs = tf.nn.relu(outputs_)
    elif activation == "tanh":
      outputs = tf.tanh(outputs_)
    elif activation == "sigmoid":
      outputs = tf.nn.sigmoid(outputs_)
    else:
      outputs = outputs_
    return  outputs

   with tf.name_scope('predictions'):
      self.sim_diff = self.pos_pair_sim - self.neg_pair_sim # shape = (batch_size, 1)
      self.preds = tf.sigmoid(self.sim_diff) # shape = (batch_size, 1)
      self.infers = self.pos_pair_sim

Below is the loss definition.It seems all right.

with tf.name_scope('predictions'):
  sim_diff = pos_pair_sim - neg_pair_sim
  predictions = tf.sigmoid(sim_diff)
  self.infers = pos_pair_sim
## loss and optim
with tf.name_scope('loss'):
  self.loss = nn_layers.cross_entropy_loss_with_reg(self.labels, self.preds)
  tf.summary.scalar('loss', self.loss)

I am not sure whether I have used the BN layers in right way. I mean that the BN parameters are derived from the hidden units from the two separate parts, which are based on qi_pos and qi_neg tensors as inputs. Anyway, anyone could help check it?

score 0 · Answer 1 · answered Nov 08 '17 at 11:48

0

Your code seems fine to me, there's no problem in applying BN in different branches of the network. But I'd like to mention few notes here:

BN hyperparameters are pretty standard, so I usually don't manually set decay, epsilon and renorm_decay. It doesn't mean you mustn't change them, it's simply unnecessary in most cases.
You're applying the BN before the activation function, however, there is evidence that it works better if applied after the activation. See, for example, this discussion. Once again, it doesn't mean it's a bug, simply one more architecture to consider.

answered Nov 08 '17 at 11:48

Maxim

52,561
27
155
209

Thanks. By the way, is it necessary to use BN in my network ? Actually my network only has three layers and the inputs are embedings of short sentence. I hear that BN is common used in the computer version network since the network's parameters are very large and dense. – yanachen Nov 09 '17 at 02:23
That's a deep question, there's no universal answer to it. BN *can* make worse, in terms of speed of convergence (typically, RNN and RL tasks). It's also possible that data is so well normalized that BN doesn't make a big difference, however not free in terms of computations. If there're only 3 layers and data is well pre-processed, I'd generally consider one or none BN, then move on with examining my model. It might end up with several BN, but not the design I'd start with. – Maxim Nov 09 '17 at 08:35

Is my code right to use batch normalization layers in tensorflow?

1 Answers1