Compute gradient norm of each part of composite loss function

Question

Assume I have the following loss function:

loss_a = tf.reduce_mean(my_loss_fn(model_output, targets))
loss_b = tf.reduce_mean(my_other_loss_fn(model_output, targets))
loss_final = loss_a + tf.multiply(alpha, loss_b)

To visualize the norm of the gradients w.r.t to loss_final one could do this:

optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
grads_and_vars = optimizer.compute_gradients(loss_final)
grads, _ = list(zip(*grads_and_vars))
norms = tf.global_norm(grads)
gradnorm_s = tf.summary.scalar('gradient norm', norms)
train_op = optimizer.apply_gradients(grads_and_vars, name='train_op')

However, I would like to plot the norm of the gradients w.r.t to loss_a and loss_b separately. How can I do this in the most efficient way? Do I have to call compute_gradients(..) on both loss_a and loss_b separately and then add those two gradients together before passing them to optimizer.apply_gradients(..)? I know that this would mathematically be correct due to the summation rule, but it just seems a bit cumbersome and I also don't know how you would implement the summation of the gradients correctly. Also, loss_final is rather simple, because it's just a summation. What if loss_final was more complicated, e.g. a division?

I'm using Tensorflow 0.12.

score 13 · Accepted Answer · edited May 23 '17 at 11:47

You are right that combining gradients could get messy. Instead just compute the gradients of each of the losses as well as the final loss. Because tensorflow optimizes the directed acyclic graph (DAG) before compilation, this doesn't result in duplication of work.

For example:

import tensorflow as tf

with tf.name_scope('inputs'):
    W = tf.Variable(dtype=tf.float32, initial_value=tf.random_normal((4, 1), dtype=tf.float32), name='W')
    x = tf.random_uniform((6, 4), dtype=tf.float32, name='x')

with tf.name_scope('outputs'):
    y = tf.matmul(x, W, name='y')

def my_loss_fn(output, targets, name):
    return tf.reduce_mean(tf.abs(output - targets), name=name)

def my_other_loss_fn(output, targets, name):
    return tf.sqrt(tf.reduce_mean((output - targets) ** 2), name=name)

def get_tensors(loss_fn):

    loss = loss_fn(y, targets, 'loss')
    grads = tf.gradients(loss, W, name='gradients')
    norm = tf.norm(grads, name='norm')

    return loss, grads, norm

targets = tf.random_uniform((6, 1))

with tf.name_scope('a'):
    loss_a, grads_a, norm_a = get_tensors(my_loss_fn)

with tf.name_scope('b'):
    loss_b, grads_b, norm_b = get_tensors(my_loss_fn)

with tf.name_scope('combined'):
    loss = tf.add(loss_a, loss_b, name='loss')
    grad = tf.gradients(loss, W, name='gradients')

with tf.Session() as sess:
    tf.global_variables_initializer().run(session=sess)

    writer = tf.summary.FileWriter('./tensorboard_results', sess.graph)
    res = sess.run([norm_a, norm_b, grad])

    print(*res, sep='\n')

Edit: In response to your comment... You can check the DAG of a tensorflow model using tensorboard. I've updated the code to store the graph.

Run tensorboard --logdir $PWD/tensorboard_results in a terminal and navigate to the url printed on the commandline (typically http://localhost:6006/). Then click on GRAPH tab to view the DAG. You can recursively expand the tensors, ops, namespaces to see subgraphs to see individual operations and their inputs.

Thanks for your answer. How do you know that this is being optimized by tensorflow and really no duplication of work happens? Can we verify what optimizations are applied to the DAG somehow? I know I could run some tests, but are there some kind of guarantees that this possible optimization is indeed applied every time or are we relying on "best effort" behavior? — kafman, May 13 '17 at 08:54
@kaufmanu updated the answer to show how to capture the graph. — Alex, May 17 '17 at 23:07

Compute gradient norm of each part of composite loss function

1 Answers1

Linked