How to calculate robust softmax function with temperature?

Question

This is a branching from another quesion/answer

I want a function equivalent to this:

def softmax(x, tau):
    """ Returns softmax probabilities with temperature tau
        Input:  x -- 1-dimensional array
        Output: s -- 1-dimensional array
    """
    e_x = np.exp(x / tau)
    return e_x / e_x.sum()

which is stable and robust, i.e. it doesn't overflow for small values of tau, nor for large x. Since this will be used to compute probabilities, the output should sum to 1.

In other words, I am passing in some values (and a temperature) and I want as output an array of probabilities "scaled" with the input and tau.

Examples:

In [3]: softmax(np.array([2,1,1,3]), 1)
Out[3]: array([ 0.22451524,  0.08259454,  0.08259454, 0.61029569])

In [5]: softmax(np.array([2,1,1,3]), 0.1)
Out[5]: array([  4.53978685e-05,   2.06106004e-09,   2.06106004e-09, 99954598e-01])

In [7]: softmax(np.array([2,1,1,3]), 5)
Out[7]: array([ 0.25914361,  0.21216884,  0.21216884,  0.31651871])

So as tau goes towards 0, the highest probability in the output is on the position of the highest element. As tau grows larger, the probabilites become closer to one another.

Optionally, questions about the linked answer. There, Neil gives the following alternative:

def nat_to_exp(q):
    max_q = max(0.0, np.max(q))
    rebased_q = q - max_q
    return np.exp(rebased_q - np.logaddexp(-max_q, np.logaddexp.reduce(rebased_q)))

However, this output does not sum to 1 and the explanation is that the function returns a categorial distribution which only has N-1 free parameters, the last one being 1 - sum(others). But upon running, I notice that for a vector of length 3 it returns a vector of length 3. So where is the missing one ? Can I make it equivalent to the above example ?

Why that answer is stable? How does one get from the simple formula of softmax to this ?

Possibly related question: General softmax but without temperature

I already explained to you that I am omitting the component from the input and the output. The reason you do that is because the softmax converts the natural parameters of the categorical distribution the expectation parameters. The natural parameters are additive. If you keep the extra component in the input, you lose this important property. — Neil G, Jan 27 '17 at 20:00
And, since you're interested in adding the temperature, you'll find this is a lot easier to implement without the extra component since in this case, temeprature is implemented by dividing the natural parameters. — Neil G, Jan 27 '17 at 20:01
Neil, I'm sorry I didn't phrase it correctly. I understood _why_ you removed a component, but I don't see how. I don't see `q[:-1]` or anything similar. If you're omitting it, why are there the same number of outputs as inputs ? Is component not the same as array element ? — Ciprian Tomoiagă, Jan 27 '17 at 20:14
I'm not passing it in to begin with. Where is your `q` coming from that you have that component? Normally, `q` comes from something like multinomial logistic regression, or from signals in a Boltzmann machine — Neil G, Jan 27 '17 at 20:15
I oscillated between stats and SO but because I'm more interested in the implementation I chose SO. — Ciprian Tomoiagă, Jan 27 '17 at 20:15

How to calculate robust softmax function with temperature?

0 Answers0