1

I need an activation function that rounds my tensors.

the derivative(gradients) of the function round() is 0 (or None in tensorflow) which makes it unusable as an activation function.

I am looking for a function that enforce rounding-like behaviour so that the results of my model dont just approximate a number. (as my labels are integers)

I know that the formulae: tanh ○ sigmoid was used to enforce {-1, 0, 1} numbers only flowing through the model so is there some combination of function that are derivable that simulate rounding behaviour?

Tissuebox
  • 1,016
  • 3
  • 14
  • 36
  • do you have a finite number of labels known ahead of time? Or is it open domain integers. – modesitt Aug 07 '18 at 18:32
  • no it is not a finite number of labels – Tissuebox Aug 07 '18 at 18:34
  • If you're writing your own activation function, you can specify your own gradient. Working out the gradient of the rounding function isn't trivial, but careful thought can get you there. – PMende Aug 07 '18 at 18:34
  • 3
    @PMende rounding is inherently **[non-differntiable](http://www.wolframalpha.com/input/?i=derivative+round(x))**. Completely disagree that OP could somehow write a meaningful function to do this. His/Her best chance is to round the result afterward and go with some sort of `MSE/Huber Loss` function – modesitt Aug 07 '18 at 18:38
  • 1
    Rounding function is not differentiable. A differentiable function is in particular continuous, and if you plot your rounding function then you will see it is discontinuous. – antonioACR1 Aug 07 '18 at 18:42
  • @modesitt By that logic, ReLU is also not differentiable. You're wrong. – PMende Aug 07 '18 at 18:50
  • 1
    @PMende Huh? ReLU **specifies** that the derivative at 0 (it's only non-differentiable point) as **0**. This is not mathematical - it is custom. There is no such way to do this with [round](https://www.desmos.com/calculator/8e0bkvxacf) (zero everywhere is the best you can do) and if you gave it some thought / knew introductory calculus it would be clear. – modesitt Aug 07 '18 at 18:53
  • @PMende The derivative of ReLU exists for `x<0` and `x>0`, there is only one exception at `x=0` and the derivative at this point is defined to be 0 just by convention, not because it's differentiable at `x=0` – antonioACR1 Aug 07 '18 at 18:59
  • 1
    @modesitt If you gave it some thought, and understood advanced calculus and the concept of distributions, it would be clear that there are derivatives. Using distributions as functions doesn't work in a straight-forward application, but one can create approximations if this is really the behavior that's desired. – PMende Aug 07 '18 at 19:12
  • @PMende Huh pt 2? pls explain how you believe this relations to stochastic calculus. If you are so convinced there is a derivative... provide one ;). Hint... you are wrong. – modesitt Aug 07 '18 at 19:13
  • 1
    @modesitt The derivative of round is a Dirac comb with frequency 1, and phase 0.5. Each Dirac delta function can be viewed as the limit of a Gaussian becoming infinitely thin and infinitely tall. Should you want rounding behavior, and the gradient of such a function, you can pick some appropriate variance for your normalized Gaussian. I'm not saying this would be computationally efficient, mind, but if it's really what someone wants, it's doable. – PMende Aug 07 '18 at 19:19
  • @PMende actually I am thinking into approximating* rounding behaviour with a continous function I will make myself. I will check out variance of normalized gaussian, thanks for that. – Tissuebox Aug 07 '18 at 19:26
  • All this results in is having zeros everywhere except for numbers that are odd multiples of 1/2 where your fancy 'Dirac comb' with some 'selected variance' is just *a somewhat large number* picked arbitrarily. Furthermore, the decision to do this is entirely arbitrary and completely un-useful for training a neural network. – modesitt Aug 07 '18 at 19:29
  • @Tissuebox If you'd really like this type of behavior on the open domain of reals, you can approximate rounding as an infinite series of step functions, then use the logistic approximation (https://en.wikipedia.org/wiki/Heaviside_step_function#Analytic_approximations), similar to user322778's answer. In this case, to define the derivative, you would take your inputs, `x`, and modulo 1 them (i.e. `excess = x %1`). The particular logistic function you would pick would be centered at 0.5, and you need only choose an appropriate parameter to specify the steepness (which would be a hyperparameter). – PMende Aug 07 '18 at 19:46
  • This is exactly what I want to do, approximating rounding as an infinite series of step functions. Questions if you have time: how do I translate that logistic(or sigmoid) function to be centered at 0.5? how do I make it infinite? – Tissuebox Aug 07 '18 at 19:57
  • @Tissuebox What if you define the derivative of your round function at `x=0.5` to be either `0` or `1` just as a convention (similar for `x=1.5`, `x=2.5`, etc)? I think you are getting things too complicated by trying to approximate a function which is NOT differentiable per se... – antonioACR1 Aug 07 '18 at 20:05
  • I think within keras and tensorflow the derivation is automatic and cannot be set manually. – Tissuebox Aug 07 '18 at 20:15
  • I'm pretty sure it's possible to set it (somehow) manually, however it would be nice if you include some initial code and a reproducible example before – antonioACR1 Aug 07 '18 at 20:27
  • of a simple keras model? Otherwise I dont know what code you are looking for, I wanted a differentiable rouding-like function, I dont know what code you are looking for – Tissuebox Aug 07 '18 at 20:30
  • @Tissuebox neither do I :), that's why I said "it would be nice if you include some initial code and a reproducible example before". – antonioACR1 Aug 07 '18 at 21:12

2 Answers2

0

Maybe the cross entropy loss with softmax function tf.nn.softmax_cross_entropy_with_logits_v2 is what you're looking for, see

https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits_v2

Also have a look at

https://deepnotes.io/softmax-crossentropy

antonioACR1
  • 1,303
  • 2
  • 15
  • 28
  • Unless I missunderstood what is said in the link of deepnotes, this loss function is not the best for open domain integers (which is what my labels are part of) – Tissuebox Aug 07 '18 at 19:24
  • What do you mean by open domain integers? If you mean you could have literally any integer as a label then I don't think you will find any appropriate loss function. When you fit a model to a dataset, you should already have all your labels available. Otherwise the fitting part doesn't make sense in the first place – antonioACR1 Aug 07 '18 at 19:29
  • well I wasnt looking for a loss function but an activation function – Tissuebox Aug 07 '18 at 19:32
0

If you'd like to approximate round on the real line, you can do something like the following:

def approx_round(x, steepness=1):
    floor_part = tf.floor(x)
    remainder = tf.mod(x, 1)
    return floor_part + tf.sigmoid(steepness*(remainder - 0.5))

There are, in fact, ways to register your own gradients in Tensorflow (see, for example, this question). However, I am not as familiar on achieving this part, as I don't use Keras/TensorFlow that often.

In terms of a function that would give you the gradient of this approximation, it would be the following:

def approx_round_grad(x, steepness=1):
    remainder = tf.mod(x, 1)
    sig = tf.sigmoid(steepness*(remainder - 0.5))
    return sig*(1 - sig)

To be clear, this approximation assumes you're using a "steep enough" steepness parameter, since the sigmoid function doesn't go to exactly 0 or 1, except in the limit of large arguments.

To do something like the half sin approximation, you could use the following:

def approx_round_sin(x, width=0.1):
    if width > 1 or width <= 0:
        raise ValueError('Width must be between zero (exclusive) and one (inclusive)')
    floor_part = tf.floor(x)
    remainder = tf.mod(x, 1)
    return (floor_part + clipped_sin(remainder, width))

def clipped_sin(x, width):
    half_width = width/2
    sin_part = (1 + tf.sin(np.pi*((x-0.5)/width)))/2
    whole = sin_part*tf.cast(tf.abs(x - 0.5) < half_width, tf.float32)
    whole += tf.cast(x > 0.5 + half_width, tf.float32)
    return whole

def approx_round_grad_sin(x, width=0.1):
    if width > 1 or width <= 0:
        raise ValueError('Width must be between zero (exclusive) and one (inclusive)')
    remainder = tf.mod(x, 1)
    return clipped_cos(remainder, width)

def clipped_cos(x, width):
    half_width = width/2
    cos_part = np.pi*tf.cos(np.pi*((x-0.5)/width))/(2*width)
    return cos_part*tf.cast(tf.abs(x - 0.5) < half_width, dtype=tf.float32)
PMende
  • 5,171
  • 2
  • 19
  • 26
  • As I'm not very familiar with the Keras/Tensorflow API, it's difficult for me to provide recommendations on encapsulating this functionality. In principle, you'd probably want to define the function in some way such that you don't have to make sure you're passing around your `steepness` hyperparameter to both the base function and its gradient consistently. You could possibly do this by defining a 3rd function that returns both the base function and its gradient as a tuple, for example. – PMende Aug 07 '18 at 21:18
  • Be careful. Your function is not differentiable at integer values either. – antonioACR1 Aug 07 '18 at 21:44
  • I tested and it work, the higher the steepness value the lower the difference with the actual round() function (tried up to 100), althought when I tried a number too high during the actual training (like 100) there is no gradients, only when it is lower does it work, I have yet to benchmark the result, I tried it on a LSTM network trying to predict the number after the one I give. (for example if I input 23, it should output 24) – Tissuebox Aug 07 '18 at 21:46
  • 1
    @Tissuebox It works approximating the round function, yes, but it is not differentiable mathematically speaking. Please read the formal definition of being differentiable. – antonioACR1 Aug 07 '18 at 21:50
  • it should be noted that no activation function gives better result in this particular case, and that a lower steepness value give better result also even if the difference between the actual round() function and the approximation increase when the steepness lower – Tissuebox Aug 07 '18 at 21:51
  • @user322778 maybe my understanding of derivatives is really low as I have yet to go in college but it shouldnt matter right? furthermore there is no integer flowing in my model – Tissuebox Aug 07 '18 at 21:53
  • @user322778 Yes, this is a very rough approximation. In terms of the definition of differentiable, it really depends on your definition. I recommend reading up on the theory of distributions. – PMende Aug 07 '18 at 21:55
  • @PMende Thanks. I'll take that into account. However, whenever you try to apply "theory of distributions" or whatever tool you prefer, you need to establish what exactly you understand by "differentiable". As a mathematician, it's important to make definitions and statements precise, otherwise it will lead to long discussions like above due to misunderstandings. – antonioACR1 Aug 07 '18 at 21:58
  • @PMende I dont know if you are interested by that but this activation function has given me results I have never seen before and cant recreate with any others: it learns linear relationship like math. I only tested addition and multiplication tho, but I only give 100 feature data and it can add correctly (with like 0.0006 or difference everytime) in the thousands. I even tried multiplying by two with 50 feature data in the negative and 50 in the positive for the training and it freaking learned to multiply real numbers by two. I tried ALL the activation function in keras and none was able to – Tissuebox Aug 08 '18 at 00:30
  • @PMende to precise, I know it has been done before but I havent find a source yet that is even close to my result outside the range of training, I only give it 100 exemple of little numbers (from -50 to 50 in my cases) and it generalise to the thousands without any error – Tissuebox Aug 08 '18 at 02:01
  • 1
    @Tissuebox That's super cool! Glad to hear that it's been successful for you. Indeed, not generalizing outside the support of the training data is a common problem in neutral networks. It's interesting to hear that your architecture works outside your training data range! Though given the nature of the activation function and its "derivative", it makes a bit of intuitive sense to me. If you publish a paper, please feel free to mention me in your acknowledgements. :P – PMende Aug 08 '18 at 04:04
  • 1
    I should say specifically that neural networks often have difficulty generalizing outside of the range of the training data if you use bounded activation functions. – PMende Aug 08 '18 at 04:21
  • yes this is true, I didnt use any sigmoid or tanh, only relu and the rouding like function for the last layer, I couldnt manage to make it learn decimal points because the best result were only when the rouding was in the last layer. I will try other thing tho, like using half a cycle of a sin or cos function instead of sigmoid. I am still looking for the state of the art of neural net making math, there is little litterature on this so I might very well publish my first paper! I wanted to ask you if I could because you actually made the function, I will give you the credit and say you helped! – Tissuebox Aug 08 '18 at 17:02