7

I want to draw a histogram and a line plot at the same graph. However, to do that I need to have my histogram as a probability mass function, so I want to have on the y-axis a probability values. However, I don't know how to do that, because using the normed option didn't helped. Below is my source code and a sneak peek of used data. I would be very grateful for all suggestions.

data = [12565, 1342, 5913, 303, 3464, 4504, 5000, 840, 1247, 831, 2771, 4005, 1000, 1580, 7163, 866, 1732, 3361, 2599, 4006, 3583, 1222, 2676, 1401, 2598, 697, 4078, 5016, 1250, 7083, 3378, 600, 1221, 2511, 9244, 1732, 2295, 469, 4583, 1733, 1364, 2430, 540, 2599, 12254, 2500, 6056, 833, 1600, 5317, 8333, 2598, 950, 6086, 4000, 2840, 4851, 6150, 8917, 1108, 2234, 1383, 2174, 2376, 1729, 714, 3800, 1020, 3457, 1246, 7200, 4001, 1211, 1076, 1320, 2078, 4504, 600, 1905, 2765, 2635, 1426, 1430, 1387, 540, 800, 6500, 931, 3792, 2598, 5033, 1040, 1300, 1648, 2200, 2025, 2201, 2074, 8737, 324]
plt.style.use('ggplot')
plt.rc('xtick',labelsize=12)
plt.rc('ytick',labelsize=12)
plt.xlabel("Incomes")
plt.hist(data, bins=50, color="blue", alpha=0.5, normed=True)
plt.show() 
mmdanziger
  • 4,466
  • 2
  • 31
  • 47
Ziva
  • 3,181
  • 15
  • 48
  • 80
  • What do you mean by *the `normed` option didn't help*? And what exactly is your question? How to normalize the distribution? Or how to plot a line over a histogram? – hitzg Jun 17 '15 at 11:16
  • @hitzig. My question is exactly what I wrote: "I want to have on the y-axis a probability values. " And the normed option following the documentation doesn't guarantee that the values on the y-axis describe probabilities (don't add up to 1). – Ziva Jun 17 '15 at 11:36
  • `normed` is depricated for `hist()`. use the `density` keyword argument instead. – Funny Geeks Nov 03 '20 at 20:21

2 Answers2

9

As far as I know, matplotlib does not have this function built-in. However, it is easy enough to replicate

    import numpy as np
    heights,bins = np.histogram(data,bins=50)
    heights = heights/sum(heights)
    plt.bar(bins[:-1],heights,width=(max(bins) - min(bins))/len(bins), color="blue", alpha=0.5)

Edit: Here is another approach from a similar question:

     weights = np.ones_like(data)/len(data)
     plt.hist(data, bins=50, weights=weights, color="blue", alpha=0.5, normed=False) 
Community
  • 1
  • 1
mmdanziger
  • 4,466
  • 2
  • 31
  • 47
  • When you pass `normed=True` it does exactly that: `values = values / sum(values)` – Primer Jun 17 '15 at 14:11
  • 7
    No it doesn't, it makes a probability density function so that the bin size multiplied by the height sums to one. See, eg http://stackoverflow.com/questions/3866520/plotting-histograms-whose-bar-heights-sum-to-1-in-matplotlib – mmdanziger Jun 17 '15 at 14:28
  • Looking at the [source](https://github.com/matplotlib/matplotlib/blob/2c4aa6d3609e6ab8f35821f0162c96d04febc65a/lib/matplotlib/axes/_axes.py#L5803) it sure looks like it takes values per bin and divides it by sum of all values, doesn't it? – Primer Jun 17 '15 at 16:58
  • 3
    `m = (m.astype(float) / db) / m.sum()` is the relevant line. That db makes all the difference, it makes the integral f(x)dx sum to one, approximating a continuous distribution. Op wants f(x) to sum to one, approximating a discrete distribution. If bin sizes are equal to 1, the definitions coincide. Otherwise, you need to do something like my answer. Look up probability mass function vs density function for more details. – mmdanziger Jun 17 '15 at 17:22
  • @mmdanziger Thank you for your answer! The first solution works very well and is very helpful. But of course, I will also check the second suggestion. I just added additional 'float' during the division, because I got zeros instead of float values. – Ziva Jun 17 '15 at 20:13
  • With Python 2, you should make it a habit to include `from __future__ import division` at the beginning of your files or interactive sessions, to avoid unexpected behavior such as this. – mmdanziger Jun 17 '15 at 20:27
1

This is old, but since I found it and was about to use it before I noticed some mistakes, I figured I'd add a comment for a couple of fixes I noticed. In the example @mmdanziger uses the bin edges in plt.bar, however, you need to actually use the centers of the bin. Also they assume that the bins are of equal width, which is fine "most" of the time. But you can also pass it an array of widths, which keep you from inadvertently forgetting and making a mistake. So here's a more complete example:

import numpy as np
heights, bins = np.histogram(data, bins=50)
heights = heights/sum(heights)
bin_centers = 0.5*(bins[1:] + bins[:-1])
bin_widths = np.diff(bins)
plt.bar(bin_centers, heights, width=bin_widths, color="blue", alpha=0.5)

@mmdanziger other option of passing weights = np.ones_like(data)/len(data) to plt.hist() also does the same thing, and for many is an easier approach.