11

Is it possible to minimise a loss function by changing only some elements of a variable? In other words, if I have a variable X of length 2, how can I minimise my loss function by changing X[0] and keeping X[1] constant?

Hopefully this code I have attempted will describe my problem:

import tensorflow as tf
import tensorflow.contrib.opt as opt

X = tf.Variable([1.0, 2.0])
X0 = tf.Variable([3.0])

Y = tf.constant([2.0, -3.0])

scatter = tf.scatter_update(X, [0], X0)

with tf.control_dependencies([scatter]):
    loss = tf.reduce_sum(tf.squared_difference(X, Y))

opt = opt.ScipyOptimizerInterface(loss, [X0])

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    opt.minimize(sess)

    print("X: {}".format(X.eval()))
    print("X0: {}".format(X0.eval()))

which outputs:

INFO:tensorflow:Optimization terminated with:
  Message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
  Objective function value: 26.000000
  Number of iterations: 0
  Number of functions evaluations: 1
X: [3. 2.]
X0: [3.]

where I would like to to find the optimal value of X0 = 2 and thus X = [2, 2]

edit

Motivation for doing this: I would like to import a trained graph/model and then tweak various elements of some of the variables depending on some new data I have.

Jeff
  • 718
  • 8
  • 20

4 Answers4

5

You can use this trick to restrict the gradient calculation to one index:

import tensorflow as tf
import tensorflow.contrib.opt as opt

X = tf.Variable([1.0, 2.0])

part_X = tf.scatter_nd([[0]], [X[0]], [2])

X_2 = part_X + tf.stop_gradient(-part_X + X)

Y = tf.constant([2.0, -3.0])

loss = tf.reduce_sum(tf.squared_difference(X_2, Y))

opt = opt.ScipyOptimizerInterface(loss, [X])

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    opt.minimize(sess)

    print("X: {}".format(X.eval()))

part_X becomes the value you want to change in a one-hot vector of the same shape as X. part_X + tf.stop_gradient(-part_X + X) is the same as X in the forward pass, since part_X - part_X is 0. However in the backward pass the tf.stop_gradient prevents all unnecessary gradient calculations.

BlueSun
  • 3,541
  • 1
  • 18
  • 37
2

I'm not sure if it is possible with the SciPy optimizer interface, but using one of the regular tf.train.Optimizer subclasses you can do something like that by calling compute_gradients first, then masking the gradients and then calling apply_gradients, instead of calling minimize (which, as the docs say, basically calls the previous ones).

import tensorflow as tf

X = tf.Variable([3.0, 2.0])
# Select updatable parameters
X_mask = tf.constant([True, False], dtype=tf.bool)
Y = tf.constant([2.0, -3.0])
loss = tf.reduce_sum(tf.squared_difference(X, Y))
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
# Get gradients and mask them
((X_grad, _),) = opt.compute_gradients(loss, var_list=[X])
X_grad_masked = X_grad * tf.cast(X_mask, dtype=X_grad.dtype)
# Apply masked gradients
train_step = opt.apply_gradients([(X_grad_masked, X)])

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for i in range(10):
        _, X_val = sess.run([train_step, X])
        print("Step {}: X = {}".format(i, X_val))
    print("Final X = {}".format(X.eval()))

Output:

Step 0: X = [ 2.79999995  2.        ]
Step 1: X = [ 2.63999987  2.        ]
Step 2: X = [ 2.51199985  2.        ]
Step 3: X = [ 2.40959978  2.        ]
Step 4: X = [ 2.32767987  2.        ]
Step 5: X = [ 2.26214385  2.        ]
Step 6: X = [ 2.20971513  2.        ]
Step 7: X = [ 2.16777205  2.        ]
Step 8: X = [ 2.13421774  2.        ]
Step 9: X = [ 2.10737419  2.        ]
Final X = [ 2.10737419  2.        ]
jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • This looks good - thanks! IIUC, unfortunately it requires calculating the gradients with respect to all the elements in `X` (in my real problem my `X` is quite big and I only want to poke at a few elements of it, my bad for not specifying that in the question) and it is not possible with the SciPy optimizer, but this is the best I have so far - so I will accept it before the bounty ends unless a better solution is found. – Jeff Mar 07 '18 at 11:48
  • 1
    @Jeff Yes, that is correct, at least about needing to compute all the gradients (I don't know much about SciPy optimizers so I can't say for sure about that). As discussed in the comments of the other answer, if the set of parameters that you need to update is fixed (e.g. always the first element of the array) and not dynamic (e.g. deciding which weights are updated depending on some value, or changing it mid-training), you could define as variables only the elements that you know you need updated and concatenate the rest of weights to make the network, I don't know if that'd work for you. – jdehesa Mar 07 '18 at 12:39
1

This should be pretty easy to do by using the var_list parameter of the minimize function.

trainable_var = X[0]
train_op = tf.train.GradientDescentOptimizer(learning_rate=1e-3).minimize(loss, var_list=[trainable_var])

You should note that by convention all trainable variables are added to the tensorflow default collection GraphKeys.TRAINABLE_VARIABLES, so you can get a list of all trainable variables using:

all_trainable_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)

This is just a list of variables which you can manipulate as you see fit and use as the var_list parameter.

As a tangent to your question, if you ever want to take customizing the optimization process a step further you can also compute the gradients manually using grads = tf.gradients(loss, var_list) manipulate the gradients as you see fit, then call tf.train.GradientDescentOptimizer(...).apply_gradients(grads_and_vars_as_list_of_tuples). Under the hood minimize is just doing these two steps for you.

Also note that you are perfectly free to create different optimizers for different collections of variables. You could create an SGD optimizer with learning rate 1e-4 for some variables, and another Adam optimizer with learning rate 1e-2 for another set of variables. Not that there's any specific use case for this, I'm just pointing out the flexibility you now have.

David Parks
  • 30,789
  • 47
  • 185
  • 328
  • The `var_list` parameter expects a list of [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable) objects, but `X[0]` is the tensor resulting of applying a slice operation to `X`. You cannot optimize with respect to that, since regular tensors cannot be updated (and in any case it wouldn't be updating the original variable, since it's a different tensor). – jdehesa Mar 06 '18 at 17:32
  • I admit that I haven't tried it with a slice, though I would expect it to work still. If it doesn't try `tf.identity(X[0])` as a way to create a separate variable name for it. I could still be wrong, but I have coded an optimizer in tensorflow and the interface to do so is quite simple. The optimizer is simply handed a list of variables and their gradients to produce updates for. If assigning `trainable_var = X[0]` doesn't work, I'll be a little surprised, I'll be even more surprised if `tf.identity` doesn't do it. I expect it to work because you can take the gradient wrt the slice. – David Parks Mar 06 '18 at 17:35
  • Well, give it a try, but I'm afraid neither slicing nor `tf.identity` create variables, they are ops that produce tensors as outputs, so I would actually be really surprised if it worked. – jdehesa Mar 06 '18 at 17:37
  • An alternative, if that all fails, would then be to start with 2 variables and concatenate them for your computations. – David Parks Mar 06 '18 at 17:38
  • Hmm, you have a good point there. I may be wrong about the slicing then. – David Parks Mar 06 '18 at 17:38
  • 1
    Yes, concatenating is another good alternative, although I don't know if that's enough for the OP or they need to be able to dynamically select the updated parameters... – jdehesa Mar 06 '18 at 17:39
  • The optimizers are receiving a list of `(variable, gradient)` pairs. So I know there's no way in the optimizer interface to do that, short of re-coding each optimizer for specific support. – David Parks Mar 06 '18 at 17:40
  • After a few more minutes of thought, I think the `apply_gradients` approach by @jdehesa is the best answer. – David Parks Mar 06 '18 at 17:43
  • Many thanks for the answer (and discussion, hence +1), but unfortunately for the reasons stated by @DavidParks it doesn't work :-( I will update my question with some motivation for doing this which hopefully will help with an answer (if this is even possible!) – Jeff Mar 07 '18 at 11:37
0

The answer by Oren in the second link below calls a function (defined in the first link) that takes a Boolean hot matrix of the parameters to optimize and the tensor of parameters. It uses stop_gradient and works like a charm for a neural network I developed.

Update only part of the word embedding matrix in Tensorflow

https://github.com/tensorflow/tensorflow/issues/9162

Begbi
  • 148
  • 1
  • 5