6

I have an example of a histogram with:

mu1 = 10, sigma1 = 10
s1 = np.random.normal(mu1, sigma1, 100000)

and calculated

hist1 = np.histogram(s1, bins=50, range=(-10,10), density=True)
for i in hist1[0]:
    ent = -sum(i * log(abs(i)))
print (ent)

Now I want to find the entropy from the given histogram array, but since np.histogram returns two arrays, I'm having trouble calculating the entropy. How can I just call on first array of np.histogram and calculate entropy? I would also get math domain error for the entropy even if my code above is correct. :(

**Edit: How do I find entropy when Mu = 0? and log(0) yields math domain error?


So the actual code I'm trying to write is:

mu1, sigma1 = 0, 1
mu2, sigma2 = 10, 1
s1 = np.random.normal(mu1, sigma1, 100000)
s2 = np.random.normal(mu2, sigma2, 100000)

hist1 = np.histogram(s1, bins=100, range=(-20,20), density=True)
data1 = hist1[0]
ent1 = -(data1*np.log(np.abs(data1))).sum() 

hist2 = np.histogram(s2, bins=100, range=(-20,20), density=True)
data2 = hist2[0]
ent2 = -(data2*np.log(np.abs(data2))).sum() 

So far, the first example ent1 would yield nan, and the second, ent2, yields math domain error :(

Vinci
  • 365
  • 1
  • 6
  • 16
  • Obviously, the problem is `log(0)` in undefined. But why do you use `range=(-20,20)`? I don't think `np.random.normal(mu2, sigma2, 100000)` produces any negative number! Anyway, as long as there are no points in some of the bins, you'll get this error! – Mahdi Sep 21 '16 at 23:05
  • Hey! Thanks! I got it sorted out. I used to trim out all the 0 data! – Vinci Sep 21 '16 at 23:33
  • You're welcome! That's a good solution! If you think my answer helped, please accept the answer so the question will be marked! – Mahdi Sep 21 '16 at 23:35
  • Why do you use `abs`? The result of histogram should always be positive – Eric Mar 15 '18 at 14:17
  • "How do I find entropy when Mu = 0? and log(0) yields math domain error?" ... by **knowing** that the result should be 0, i.e. the mathematical limit as `i` goes to 0 of `i * abs(log(i))`, and special-casing it. This is effectively a math problem, not a programming problem. – Karl Knechtel Jan 11 '23 at 02:59

3 Answers3

10

You can calculate the entropy using vectorized code:

import numpy as np

mu1 = 10
sigma1 = 10

s1 = np.random.normal(mu1, sigma1, 100000)
hist1 = np.histogram(s1, bins=50, range=(-10,10), density=True)
data = hist1[0]
ent = -(data*np.log(np.abs(data))).sum()
# output: 7.1802159512213191

But if you like to use a for loop, you may write:

import numpy as np
import math

mu1 = 10
sigma1 = 10

s1 = np.random.normal(mu1, sigma1, 100000)
hist1 = np.histogram(s1, bins=50, range=(-10,10), density=True)
ent = 0
for i in hist1[0]:
    ent -= i * math.log(abs(i))
print (ent)
# output: 7.1802159512213191
Mahdi
  • 3,188
  • 2
  • 20
  • 33
  • Thank you Mahdi for the answer! However, for me it is returning RuntimeWarning: divide by zero encountered in log ent = -(i*np.log(abs(i))).sum() RuntimeWarning: invalid value encountered in double_scalars ent = -(i*np.log(abs(i))).sum() nan – Vinci Sep 21 '16 at 21:30
  • and for the for loop, I get math domain error? What would be the problem? – Vinci Sep 21 '16 at 21:31
  • @JinJeon: Could you produce a new set of `s1` values and repeat the above code? It seems some numbers are too close to zero. – Mahdi Sep 21 '16 at 21:35
  • hmm I'm still trying, and tried your version but still returns the same nan problem and math domain error. – Vinci Sep 21 '16 at 21:47
  • Oh hmm weird. So if I try it on terminal and write it all manually again, it works. but whenever I import my .py file it shows the same error hmm :S – Vinci Sep 21 '16 at 21:52
  • Maybe showing the whole code would help. Could you please update your question!? – Mahdi Sep 21 '16 at 22:40
  • Hey yep. I just attached my actual code. Didn't try to post the actual one, but I kept having issues so. Would really appreciate if you could help me out! – Vinci Sep 21 '16 at 22:47
  • Isn't this solution only correct when the bin width is exactly 1? The histogram returns densities (not probabilities), while the entropy formular for densities requires multiplication of dx. Hence I would argue the formular in the for-loop should be multiplied with the bin width. – Jonathan Sep 29 '21 at 07:39
2

Use np.ma.log to avoid inf and nan errors. np.ma is a masked numpy array.

adam.hendry
  • 4,458
  • 5
  • 24
  • 51
2

So for the ultimate copypaste experience, I just merged both existing answers (thank you all) into a more comprehensive numpy-native approach. Hope it helps!

def entropy(hist, bit_instead_of_nat=False):
    """
    given a list of positive values as a histogram drawn from any information source,
    returns the entropy of its probability mass function. Usage example:
      hist = [513, 487] # we tossed a coin 1000 times and this is our histogram
      print entropy(hist, True)  # The result is approximately 1 bit
      hist = [-1, 10, 10]; hist = [0] # this kind of things will trigger the warning
    """
    h = np.asarray(hist, dtype=np.float64)
    if h.sum()<=0 or (h<0).any():
        print "[entropy] WARNING, malformed/empty input %s. Returning None."%str(hist)
        return None
    h = h/h.sum()
    log_fn = np.ma.log2 if bit_instead_of_nat else np.ma.log
    return -(h*log_fn(h)).sum()

Note: Probability density function and probability mass function behave differently on discrete histograms, depending on the size of the bins. See the np.histogram docstring:

density : bool, optional

If False, the result will contain the number of samples in each bin. If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.

Overrides the normed keyword if given.

fr_andres
  • 5,927
  • 2
  • 31
  • 44