Matplotlib: How to convert a histogram to a discrete probability mass function?

Question

I have a question regarding the hist() function with matplotlib.

I am writing a code to plot a histogram of data who's value varies from 0 to 1. For example:

values = [0.21, 0.51, 0.41, 0.21, 0.81, 0.99]

bins = np.arange(0, 1.1, 0.1)
a, b, c = plt.hist(values, bins=bins, normed=0)
plt.show()

The code above generates a correct histogram (I could not post an image since I do not have enough reputation). In terms of frequencies, it looks like:

[0 0 2 0 1 1 0 0 1 1]

I would like to convert this output to a discrete probability mass function, i.e. for the above example, I would like to get a following frequency values:

[ 0.  0.  0.333333333  0.  0.166666667  0.166666667  0.  0.  0.166666667  0.166666667 ] # each item in the previous array divided by 6)

I thought I simply need to change the parameter in the hist() function to 'normed=1'. However, I get the following histogram frequencies:

[ 0.  0.  3.33333333  0.  1.66666667  1.66666667  0.  0.  1.66666667  1.66666667 ]

This is not what I expect and I don't know how to get the discrete probability mass function who's sum should be 1.0. A similar question was asked in the following link (link to the question), but I do not think the question was resolved.

I appreciate for your help in advance.

Are you sure you didn't just miss an "e-2" at the end of that output? — Drew Hall, Jul 31 '12 at 23:16
Actually, the answer (and the comments) given in your link are correct: it is the *integral* over the histogram that equals 1. In your example, take the value of each bar, multiply it by the bar width, and add them all up. You'll find it is 1 (leaving out the bars with 0: 10/3*0.1 + 5/3*0.1 + 5/3*0.1 + 5/3*0.1 + 5/3*0.1 = 30/3*0.1 = 1). The underlying numpy routine works that way. You may have play around with a numpy.histogram and a barplot to get what you want. — , Jul 31 '12 at 23:27
Hi Drew, No, I just copied the output of plt.hist(...). So it was supposed to be like that. But thank you for your comment! — Kotaro, Aug 01 '12 at 00:13
Hi Evert, Hmmm, I see. I wonder why numpy does it though. And given the fact that the width of bins are specified (0.1), I wish it does the calculation that you mentioned automatically :| Thank you very much for your comment! — Kotaro, Aug 01 '12 at 00:15

score 7 · Accepted Answer · answered Aug 01 '12 at 08:15

The reason is norm=True gives the probability density function. In probability theory, a probability density function or density of a continuous random variable, describes the relative likelihood for this random variable to take on a given value.

Let us consider a very simple example.

x=np.arange(0.1,1.1,0.1)
array([ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

# Bin size
bins = np.arange(0.05, 1.15, 0.1)
np.histogram(x,bins=bins,normed=1)[0]
[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]
np.histogram(x,bins=bins,normed=0)[0]/float(len(x))
[ 0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1,  0.1]

# Change the bin size
bins = np.arange(0.05, 1.15, 0.2)
np.histogram(x,bins=bins,normed=1)[0]
[ 1.,  1.,  1.,  1.,  1.]
np.histogram(x,bins=bins,normed=0)[0]/float(len(x))
[ 0.2,  0.2,  0.2,  0.2,  0.2]

As, you can see in the above, the probability that x will lie between [0.05-0.15] or [0.15-0.25] is 1/10 whereas if you change the bin size to 0.2 then the probability that it will lie between [0.05-0.25] or [0.25-0.45] is 1/5. Now these actual probability values are dependent on the bin-size, however, the probability density is independent of the bins size. Thus, this is the only proper way to do the above, otherwise one would need to state the bin-width in each of the plot.

So in your case if you really want to plot the probability value at each bin (and not the probability density) then you can simply divide the frequency of each histogram by the number of total elements. However, I would suggest you not to do this unless you are working with discrete variables and each of your bins represent a single possible value of this variable.

Hi imsc, I realized I somehow did not submit my comments. Thank you, your answer helped! — Kotaro, Aug 26 '12 at 05:45

score 0 · Answer 2 · answered Sep 13 '17 at 16:58

Plotting a Continuous Probability Function(PDF) from a Histogram – Solved in Python. refer this blog for detailed explanation. (http://howdoudoittheeasiestway.blogspot.com/2017/09/plotting-continuous-probability.html) Else you can use the code below.

n, bins, patches = plt.hist(A, 40, histtype='bar')
plt.show()
n = n/len(A)
n = np.append(n, 0)
mu = np.mean(n)
sigma = np.std(n)
plt.bar(bins,n, width=(bins[len(bins)-1]-bins[0])/40)
y1= (1/(sigma*np.sqrt(2*np.pi))*np.exp(-(bins - mu)**2 /(2*sigma**2)))*0.03
plt.plot(bins, y1, 'r--', linewidth=2)
plt.show()

Matplotlib: How to convert a histogram to a discrete probability mass function?

2 Answers2

Linked