2

I am new to machine learning and learning how to implement the softmax in python, I was following the below thread

Softmax function - python

I was doing some analysis and say if we have a array

batch = np.asarray([[1000,2000,3000,6000],[2000,4000,5000,6000],[1000,2000,3000,6000]])
batch1 = np.asarray([[1,2,2,6000],[2,5,5,3],[3,5,2,1]])

and try to implement softmax (as mentioned in the link above) via:

1) Shared by Pab Torre:

np.exp(z) / np.sum(np.exp(z), axis=1, keepdims=True)

2) Asked in initial question:

e_x = np.exp(x - np.max(x))
return e_x / e_x.sum() 

With both of these I am getting errors (value out of bound), so I kind a use the normalization and try to run it

x= np.mean(batch1)
y = np.std(batch1)
e_x = np.exp((batch1 - x)/y)
j = e_x / e_x.sum(axis = 0)

So my questions to all, is this the way I can implement? If not how can I handle the above cases?

Thanks in advance

Maxim
  • 52,561
  • 27
  • 155
  • 209
user6658936
  • 111
  • 1
  • 3
  • 10
  • What's the error with 2)? With described normalization you're making a *big change*, which in many cases ruins the probabilities – Maxim Oct 24 '17 at 17:24
  • @Maxim The problem is a math range error: *e*^710 overflows the `float` limit. The given values range to 6000. – Prune Oct 24 '17 at 17:38

3 Answers3

2

The method in 2) is quite stable numerically. Most likely, the error is produced from some other line. See these examples (all work without error):

def softmax(x):
  e_x = np.exp(x - np.max(x))
  return e_x / e_x.sum()

print softmax(np.array([0, 0, 0, 0]))
print softmax(np.array([1000, 2000, 3000, 6000]))
print softmax(np.array([2000, 4000, 5000, 6000]))
print softmax(np.array([1000, 2000, 3000, 6000]))
print softmax(np.array([2000, 2000, 2001, 2000]))
print softmax(np.array([1, 2, 2, 600000]))
print softmax(np.array([1, 2, 2, 60000000]))
print softmax(np.array([1, 2, 2, -60000000]))

Your alternative implementation makes all values closer to 0, which squashes the probabilities. For example:

def alternative_softmax(x):
  mean = np.mean(x)
  std = np.std(x)
  norm = (x - mean) / std
  e_x = np.exp(norm)
  return e_x / e_x.sum(axis=0)


print softmax(np.array([1, 2, 2, 6000]))
print softmax(np.array([2, 5, 5, 3]))
print softmax(np.array([3, 5, 2, 1]))
print

batch = np.asarray([[1, 2, 2, 6000],
                    [2, 5, 5, 3],
                    [3, 5, 2, 1]])
print alternative_softmax(batch)

The output is:

[ 0.  0.  0.  1.]
[ 0.02278457  0.45764028  0.45764028  0.06193488]
[ 0.11245721  0.83095266  0.0413707   0.01521943]

[[ 0.33313225  0.33293125  0.33313217  0.94909178]
 [ 0.33333329  0.33353437  0.33373566  0.02546947]
 [ 0.33353446  0.33353437  0.33313217  0.02543875]]

As you can see, the outputs are very different, and the rows don't even sum up to one.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • thanks Maxim, I am running the mean and std with the following input np.array([1, 2, 2, 6000]) getting the following value [ 0.07650087 0.07653033 0.07653033 0.77043848] ( sums to 1) – user6658936 Oct 24 '17 at 19:21
  • I think your implementation is slightly different, but that's not the point. Signal normalization can be useful, but it's a question of neural network design. My point is: there's nothing wrong if NN is giving some calss exactly 1.0 probability. Softmax implementation doesn't change because of this. – Maxim Oct 24 '17 at 19:45
0

np.exp(1000) is just way too big of a number. Try using the Decimal library instead.

Engineero
  • 12,340
  • 5
  • 53
  • 75
Mohammad Athar
  • 1,953
  • 1
  • 15
  • 31
  • If the values are somewhat close, `x - np.max(x)` is going to fix this. Otherwise, the result will be one-hot vector anyway. – Maxim Oct 24 '17 at 17:47
0

Here's a simple example: two small integers, 10 and 20.

>>> a = 10
>>> b = 20
>>> denom = math.exp(a) + math.exp(b)
>>> math.exp(a) / denom
4.5397868702434395e-05
>>> math.exp(b) / denom
0.9999546021312976
>>> # Now, let's perform batch-norm on this ...
>>> a = -1
>>> b = 1
>>> denom = math.exp(a) + math.exp(b)
>>> math.exp(a) / denom
0.11920292202211756
>>> math.exp(b) / denom
0.8807970779778824

The results are quite different, unacceptably so. Applying batch-norm doesn't work. Look at your equation again:

j = e_x / e_x.sum(axis = 0)

... and apply it to these simple values:

j = math.exp(10) / (math.exp(10) + math.exp(20))

ANALYSIS AND PROPOSED SOLUTION

What transformation can you apply that preserves the value of j?

The problem your actual data set hits is that you're trying to represent a value range of e^5000, no matter what shift you make in the exponent values. Are you willing to drive all very-very-small values to 0? If so, you can build an effective algorithm by subtracting a constant from each exponent, until all are, say, 300 or less. This will leave you with results mathematically similar to the original.

Can you handle that code yourself? Find the max of the array; if it's more than 300, find the difference, diff. Subtract diff from every array element. Then do your customary softmax.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • `0.9999546021312976` is absolutely reasonable probability for `softmax([10, 20])`. This is what it **should** return! – Maxim Oct 24 '17 at 18:07
  • That's part of my point -- but showing that you get *different* values after batch-norm. – Prune Oct 24 '17 at 18:10
  • Agree. Though this question is how to code the softmax correctly. Whether or not apply BN is a design question, it's fairly possible to make simply worse. – Maxim Oct 24 '17 at 18:14
  • That's why that last paragraphs describe what to do *instead* of BN. – Prune Oct 24 '17 at 18:17
  • Prune, in your example, the probabilities changes with -1,1 compared to 10, 20 so does it matter or not for softmax – user6658936 Oct 24 '17 at 19:31
  • @user6658936 -- That's exactly the *point*; I've clarified that cadence in the description. – Prune Oct 24 '17 at 20:27