6

I am using Keras with tensorflow backend and I am curious whether it is possible to skip a layer during backpropagation but have it execute in the forward pass. So here is what I mean

Lambda (lambda x: a(x))

I want to apply a to x in the forward pass but I do not want a to be included in the derivation when the backprop takes place.

I was trying to find a solution bit I could not find anything. Can somebody help me out here?

nemo
  • 55,207
  • 13
  • 135
  • 135
DalekSupreme
  • 1,493
  • 3
  • 19
  • 32
  • do you want to freeze it (= not update the weights for that specific layer)? – Nassim Ben Apr 07 '17 at 16:03
  • No. Lets say a(x) = 1/(1+e^x). Then In the forward pass I want to push x through the sigmoid function but in the back propagation I do not want to include the derivative of the sigmoid – DalekSupreme Apr 07 '17 at 16:09
  • Sorry can't help you there... i don't really see the purpose of derivating another function than the one you want to minimize? The backprop loses its purpose – Nassim Ben Apr 07 '17 at 16:12
  • Did you work it out? I need the same feature. Can you please show your solution? – Juan Wang Oct 05 '18 at 00:13

2 Answers2

4

UPDATE 2

In addition to tf.py_func, there is now an official guide on how to add a custom op.


UPDATE

See this question for an example of writing a custom op with gradient purely in Python without needing to rebuild anything. Note that there are some limitations to the method (see the documentation of tf.py_func).


Not exactly a solution to the problem, but still kind of an answer and too long for comments.

That's not even a Keras issue, but a TensorFlow one. Each op defines its own gradient computation that is used during backpropagation. I you really wanted to something like that, you would need to implement the op into TensorFlow yourself (no easy feat) and define the gradient that you want - because you can't have "no gradient", if anything it would be 1 or 0 (otherwise you can't go on with backpropagation). There is a tf.NoGradient function in TensorFlow which causes an op to propagate zeros, but I don't think it is meant to / can be used out of TensorFlow own internals.

UPDATE

Okay so a bit more of context. TensorFlow graphs are built of ops, which are implemented by kernels; this is basically a 1-to-1 mapping, except that there may be for example a CPU and a GPU kernel for an op, hence the differentiation. The set of ops supported by TensorFlow is usually static, I mean it can change with newer versions, but in principle you cannot add your own ops, because the ops of a graph go into the Protobuf serialized format, so if you made your own ops then you would not be able to share your graph. Ops are then defined at C++ level with the macro REGISTER_OP (see for example here), and kernels with REGISTER_KERNEL_BUILDER (see for example here).

Now, where do gradients come into play? Well, the funny thing is that the gradient of an op is not defined at C++ level; there are ops (and kernels) that implement the gradient of other ops (if you look at the previous files you'll find ops/kernels with the name ending in Grad), but (as far as I'm aware) these are not explicitly "linked" at this level. It seems that the associations between ops and their gradients is defined in Python, usually via tf.RegisterGradient or the aforementioned tf.NoGradient (see for example here, Python modules starting with gen_ are autogenerated with the help of the C++ macros); these registrations inform the backpropagation algorithm about how to compute the gradient of the graph.

So, how to actually work this out? Well, you need to create at least one op in C++ with the corresponding kernel/s implementing the computation that you want for your forward pass. Then, if the gradient computation that you want to use can be expressed with existing TensorFlow ops (which is most likely), you would just need to call tf.RegisterGradient in Python and do the computation there in "standard" TensorFlow. This is quite complicated, but the good news is it's possible, and there's even an example for it (although I think they kinda forgot the gradient registration part in that one)! As you will see, the process involves compiling the new op code into a library (btw I'm not sure if any of this may work on Windows) that is then loaded from Python (obviously this involves going through the painful process of manual compilation of TensorFlow with Bazel). A possibly more realistic example can be found in TensorFlow Fold, an extension of TensorFlow for structured data that register (as of one) one custom operation here through a macro defined here that calls REGISTER_OP, and then in Python it loads the library and register its gradient here through their own registration function defined here that simply calls tf.NotDifferentiable (another name for tf.NoGradient)

tldr: It is rather hard, but it can be done and there are even a couple of examples out there.

jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • Thanks for your answer. Well it makes sense if you want to calculate the forward pass with a non-differentiable function and then do the back-propagation with a function very similar but differentiable, we will see. Can you give me link where they describe how to implement the op and embed it into a Keras layer? – DalekSupreme Apr 07 '17 at 17:52
  • 1
    @DalekSupreme Ohh I see, so it's not "deleting" the gradient but replacing it with an "alternative" computation, okay yes that makes sense. I'll see if I can find an example but unless in wrong (and I could be) I think it would involve changes in the C++ side of things and recompilation... But still should be at least possible. – jdehesa Apr 07 '17 at 18:02
  • Well I was kinda hoping for a simpler solution but if that is the case then I at least know that it very hard to achieve. Lets hope for someone with an easier solution. If there is no such thing I will accept your answer. Thanks for your help :) – DalekSupreme Apr 07 '17 at 19:05
  • @DalekSupreme I've updated the answer with some more info. However, I keep talking about the "hard" way to do it; maybe there is some other trick or workaround to do what you want that I'm not aware of. No need to accept the answer if you don't think it answers your question, I'd be glad to see someone showing a better/easier solution. – jdehesa Apr 07 '17 at 21:27
  • Thank you for the extensive description. Sure thing it is useful for me. – DalekSupreme Apr 08 '17 at 06:47
  • @DalekSupreme I've updated the answer with a link to [this question](https://stackoverflow.com/questions/39048984/tensorflow-how-to-write-op-with-gradient-in-python) that explains a better to achieve what you want from Python (without recompilation). It uses [`tf.py_func`](https://www.tensorflow.org/api_docs/python/tf/py_func), which has some limitations, but if you are not serializing the graph or doing distributed training it should be alright. – jdehesa May 23 '17 at 10:29
  • @DalekSupreme Updated again with a link to an official guide on adding custom ops. – jdehesa Aug 17 '17 at 14:54
  • Is it true that having skip layer connection (E.g connecting the output of neurons from 2nd layers to the inputs of neurons from the 4th layer) will not allow using adam optimizer ?? @jdehesa – MUK Mar 29 '22 at 10:06
  • @MUK I have not tried that specifically, but I have never heard of such limitation, and I cannot think why that would be the case. – jdehesa Mar 29 '22 at 11:13
0

As mentioned in @jdehesa's comments. You can implement your function with an "alternative gradient". Forgive me if my math is not correct, but I think a derivative returning "1" would be the correct way to have no effect on the backpropagation while still passing the learning through. For how to construct it, see here. The example I cited goes further and allows you to construct an activation function from a python function. So in place of the spiky function, substitute your function a, and in place of his derivative d_spiky replace it with

def constant(x):
       return 1

So on the forward pass, a is applied in the layer and the the backwards pass 1 is applied which should simply pass the weight adjustments through.

You can then just create an Activation layer in Keras using this function.