TensorFlow - introducing both L2 regularization and dropout into the network. Does it makes any sense?

Question

I am currently playing with ANN which is part of Udactity DeepLearning course.

I successful built and train network and introduced the L2 regularization on all weights and biases. Right now I am trying out the dropout for hidden layer in order to improve generalization. I wonder, does it makes sense to both introduce the L2 regularization into the hidden layer and dropout on that same layer? If so, how to do this properly?

During dropout we literally switch off half of the activations of hidden layer and double the amount outputted by rest of the neurons. While using the L2 we compute the L2 norm on all hidden weights. But I am not sure how to compute L2 in case we use dropout. We switch off some activations, shouldn't we remove the weights which are 'not used' now from the L2 calculation? Any references on that matter will be useful, I haven't found any info.

Just in case you are interested, my code for ANN with L2 regularization is below:

#for NeuralNetwork model code is below
#We will use SGD for training to save our time. Code is from Assignment 2
#beta is the new parameter - controls level of regularization. Default is 0.01
#but feel free to play with it
#notice, we introduce L2 for both biases and weights of all layers

beta = 0.01

#building tensorflow graph
graph = tf.Graph()
with graph.as_default():
      # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  #now let's build our new hidden layer
  #that's how many hidden neurons we want
  num_hidden_neurons = 1024
  #its weights
  hidden_weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_hidden_neurons]))
  hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))

  #now the layer itself. It multiplies data by weights, adds biases
  #and takes ReLU over result
  hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)

  #time to go for output linear layer
  #out weights connect hidden neurons to output labels
  #biases are added to output labels  
  out_weights = tf.Variable(
    tf.truncated_normal([num_hidden_neurons, num_labels]))  

  out_biases = tf.Variable(tf.zeros([num_labels]))  

  #compute output  
  out_layer = tf.matmul(hidden_layer,out_weights) + out_biases
  #our real output is a softmax of prior result
  #and we also compute its cross-entropy to get our loss
  #Notice - we introduce our L2 here
  loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    out_layer, tf_train_labels) +
    beta*tf.nn.l2_loss(hidden_weights) +
    beta*tf.nn.l2_loss(hidden_biases) +
    beta*tf.nn.l2_loss(out_weights) +
    beta*tf.nn.l2_loss(out_biases)))

  #now we just minimize this loss to actually train the network
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

  #nice, now let's calculate the predictions on each dataset for evaluating the
  #performance so far
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(out_layer)
  valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)
  valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases) 

  test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)
  test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)



#now is the actual training on the ANN we built
#we will run it for some number of steps and evaluate the progress after 
#every 500 steps

#number of steps we will train our ANN
num_steps = 3001

#actual training
with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset[offset:(offset + batch_size), :]
    batch_labels = train_labels[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
      print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

Why are you regularizing the biases? – Tamim Addari Aug 20 '16 at 07:30 — Tamim Addari, Aug 20 '16 at 07:30

score 18 · Accepted Answer · edited Aug 08 '16 at 10:12

Ok, after some additional efforts I managed to solve it and introduce both L2 and dropout into my network, code is below. I got slight improvement over the same network without the dropout (with L2 in place). I am still not sure if it really worth the effort to introduce both of them, L2 and dropout but at least it works and slightly improves the results.

#ANN with introduced dropout
#This time we still use the L2 but restrict training dataset
#to be extremely small

#get just first 500 of examples, so that our ANN can memorize whole dataset
train_dataset_2 = train_dataset[:500, :]
train_labels_2 = train_labels[:500]

#batch size for SGD and beta parameter for L2 loss
batch_size = 128
beta = 0.001

#that's how many hidden neurons we want
num_hidden_neurons = 1024

#building tensorflow graph
graph = tf.Graph()
with graph.as_default():
  # Input data. For the training data, we use a placeholder that will be fed
  # at run time with a training minibatch.
  tf_train_dataset = tf.placeholder(tf.float32,
                                    shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  #now let's build our new hidden layer
  #its weights
  hidden_weights = tf.Variable(
    tf.truncated_normal([image_size * image_size, num_hidden_neurons]))
  hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))

  #now the layer itself. It multiplies data by weights, adds biases
  #and takes ReLU over result
  hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset, hidden_weights) + hidden_biases)

  #add dropout on hidden layer
  #we pick up the probabylity of switching off the activation
  #and perform the switch off of the activations
  keep_prob = tf.placeholder("float")
  hidden_layer_drop = tf.nn.dropout(hidden_layer, keep_prob)  

  #time to go for output linear layer
  #out weights connect hidden neurons to output labels
  #biases are added to output labels  
  out_weights = tf.Variable(
    tf.truncated_normal([num_hidden_neurons, num_labels]))  

  out_biases = tf.Variable(tf.zeros([num_labels]))  

  #compute output
  #notice that upon training we use the switched off activations
  #i.e. the variaction of hidden_layer with the dropout active
  out_layer = tf.matmul(hidden_layer_drop,out_weights) + out_biases
  #our real output is a softmax of prior result
  #and we also compute its cross-entropy to get our loss
  #Notice - we introduce our L2 here
  loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
    out_layer, tf_train_labels) +
    beta*tf.nn.l2_loss(hidden_weights) +
    beta*tf.nn.l2_loss(hidden_biases) +
    beta*tf.nn.l2_loss(out_weights) +
    beta*tf.nn.l2_loss(out_biases)))

  #now we just minimize this loss to actually train the network
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

  #nice, now let's calculate the predictions on each dataset for evaluating the
  #performance so far
  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(out_layer)
  valid_relu = tf.nn.relu(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)
  valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases) 

  test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)
  test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)



#now is the actual training on the ANN we built
#we will run it for some number of steps and evaluate the progress after 
#every 500 steps

#number of steps we will train our ANN
num_steps = 3001

#actual training
with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print("Initialized")
  for step in range(num_steps):
    # Pick an offset within the training data, which has been randomized.
    # Note: we could use better randomization across epochs.
    offset = (step * batch_size) % (train_labels_2.shape[0] - batch_size)
    # Generate a minibatch.
    batch_data = train_dataset_2[offset:(offset + batch_size), :]
    batch_labels = train_labels_2[offset:(offset + batch_size), :]
    # Prepare a dictionary telling the session where to feed the minibatch.
    # The key of the dictionary is the placeholder node of the graph to be fed,
    # and the value is the numpy array to feed to it.
    feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels, keep_prob : 0.5}
    _, l, predictions = session.run(
      [optimizer, loss, train_prediction], feed_dict=feed_dict)
    if (step % 500 == 0):
      print("Minibatch loss at step %d: %f" % (step, l))
      print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
      print("Validation accuracy: %.1f%%" % accuracy(
        valid_prediction.eval(), valid_labels))
      print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))

The original paper on dropout does specifically call out this kind of configuration, so you're probably in good shape trying this. Though I might note that I don't think it's normal to include L2 regularization on the biases, only on the weights. http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf http://stats.stackexchange.com/questions/153605/no-regularisation-term-for-bias-unit-in-neural-network — David Parks, Aug 31 '16 at 00:43
@DavidParks It seems like we need to include L2 on biases as well. Please , take a look onto tensorflow MNIST example here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/mnist/convolutional.py Search over 'l2_loss' function calls. — Petr Shypila, Sep 02 '16 at 15:34
Take this as an example: We have a single feature x, and it's values y, and we perform a linear fit to the data, y=mx+b. If all data points cluster around y=1000 with little variance, we'll need a large bias to shift the line up to 1000. That's isn't a problem to adjust for, it's just where the data lies. The problem is when we overweight a feature. The bias is just an offset. With that said I've plotted a histogram of weights and biases on classification and regression problems recently and in neither case did I see biases that were large. So I doubt it's causing a noticeable problem. — David Parks, Nov 13 '16 at 21:55
@PetrShypila The example provided seems wrong. From what I understand, it doesn't make sense to apply regularization to the bias. The bias won't make your model overfit, so it shouldn't be penalized. Here's another course on regularization: https://www.youtube.com/watch?v=ef2OPmANLaM (BTW I get 93% accuracy in 3000 steps w/o regularising the bias) — Julien, Nov 26 '16 at 08:55
@Julien well, this examples provided by Tensorflow developes. Honestly I am not deep expert in Tensorflow and ML, so those guys can be mistaken as well — Petr Shypila, Nov 26 '16 at 13:24
Here's the reference regarding regularizing the bias that I wanted: http://www.deeplearningbook.org/contents/regularization.html, just search "bias" and you'll find a paragraph in there regarding not regularizing the bias unit. — David Parks, Dec 07 '16 at 20:56
yes. Bias should not be regularized. I removed that from your code in my own implementation and it worked slightly better. — marc, Dec 09 '16 at 23:34
It works just fine as you wrote it but reduce_mean results in a scalar as does the l2_loss, you're however first adding a scalar to a tensor and then reduce_mean. This is not necessary and I'd expect it to be slower. Do this instead: `loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits( out_layer, tf_train_labels)) + beta*tf.nn.l2_loss(hidden_weights) + ...` — Bastiaan, Jun 28 '17 at 04:37
I have never seen the `beta` being multiplied in the loss function. Is this common? Should this a be a parameter that is tuned? — O.rka, Jul 24 '17 at 17:39

score 9 · Answer 2 · answered May 28 '17 at 05:09

There is no downside to use multiple regularizations. In fact there is a paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting where authors checked how much it helps. Clearly for different datasets you will have different results, but for your MNIST:

you can see that Dropout + Max-norm gives the lowest error. Apart of this you have a big error in your code.

You use l2_loss on weights and biases:

beta*tf.nn.l2_loss(hidden_weights) +
beta*tf.nn.l2_loss(hidden_biases) +
beta*tf.nn.l2_loss(out_weights) +
beta*tf.nn.l2_loss(out_biases)))

You should not penalize high biases. So remove l2_loss over biases.

score 4 · Answer 3 · answered Dec 15 '16 at 08:06

Actually, the original paper uses max-norm regularization, and not L2, in addition to dropout: "The neural network was optimized under the constraint ||w||2 ≤ c. This constraint was imposed during optimization by projecting w onto the surface of a ball of radius c, whenever w went out of it. This is also called max-norm regularization since it implies that the maximum value that the norm of any weight can take is c" (http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)

You can find a nice discussion about this regularization method here: https://plus.google.com/+IanGoodfellow/posts/QUaCJfvDpni

TensorFlow - introducing both L2 regularization and dropout into the network. Does it makes any sense?

3 Answers3