10

Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.

Here is a MWE to display a histogram with normally distributed sample data:

import matplotlib.pyplot as plt
import numpy as np

x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()

Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.

Community
  • 1
  • 1
wrkyle
  • 529
  • 1
  • 13
  • 36
  • I assume you will specify the number of bins as input? – farenorth Jun 06 '16 at 04:00
  • Hmm, that's a good point. I suppose I would need to specify that. So, yes, the number of bins would be set. – wrkyle Jun 06 '16 at 04:09
  • The thing is the area will depend on the width and height. The height is given by the number of elements inside the bin, and the width will determine the interval for those elements. So if you want to change the area the numbers on your X axis would also be changed (leaving the normal Cartesian space). Is this what you want? A non-linear X axis? Or I'm I interpreting this wrongly...? – armatita Jun 06 '16 at 20:04

1 Answers1

18

I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.

Density Functions


This will do what you want for a density function:

def histedges_equalN(x, nbin):
    npt = len(x)
    return np.interp(np.linspace(0, npt, nbin + 1),
                     np.arange(npt),
                     np.sort(x))

x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)

Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.

equal area density function

Histograms


Here is a solution that gives approximately equal area boxes for a histogram:

def histedges_equalA(x, nbin):
    pow = 0.5
    dx = np.diff(np.sort(x))
    tmp = np.cumsum(dx ** pow)
    tmp = np.pad(tmp, (1, 0), 'constant')
    return np.interp(np.linspace(0, tmp.max(), nbin + 1),
                     tmp,
                     np.sort(x))

n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)

These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.

Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).

nearly equal area histogram

The issue with equal-area histograms

As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:

Diagram demonstrating histogram bin areas

Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.

Community
  • 1
  • 1
farenorth
  • 10,165
  • 2
  • 39
  • 45
  • 7
    You are the personification of everything that's right with Stack Overflow. Thank you for your detailed, well-documented, and elegant solution. – wrkyle Jun 06 '16 at 22:31
  • 2
    I just wish SO was filled with Qs that were this interesting. Tho, if that were the case I'd never get anything else done... – farenorth Jun 06 '16 at 23:29
  • Amazing answer. I am trying to expand from this to know the center and upper and lower boundary of each bin. Is it possible to pull these values from matplotlib or do I need to derive a custom function? – thejahcoop Jun 16 '22 at 14:18