14

I have a sorted vector points with 100 points. I now want to create two histograms: the first histogram should have 10 bins having equal width. The second should also have 10 histograms, but not necessarily of equal width. In the second, I just want the histogram to have the same number of points in each bin. So for example, the first bar might be very short and wide, while the second bar in the histogram might be very tall and narrow. I have code that creates the first histogram using matplotlib, but now I'm not sure how to go about creating the second one.

import matplotlib.pyplot as plt
points = [1,2,3,4,5,6, ..., 99]
n, bins, patches = plt.hist(points, 10)

Edit:

Trying the solution below, I'm a bit puzzled as to why the heights of all of the bars in my histogram are the same.

enter image description here

Apollo
  • 8,874
  • 32
  • 104
  • 192
  • 4
    Of course, the height of all the bars is the same if each bin should contain the same number of points, because the height of a bar is the number of points pertaining to that bin (per definition of an histogram). See the accepted answer edit which says the same thing. – Nikana Reklawyks Feb 10 '17 at 20:51

4 Answers4

22

This question is similar to one that I wrote an answer to a while back, but sufficiently different to warrant it's own question. The solution, it turns out, uses basically the same code from my other answer.

def histedges_equalN(x, nbin):
    npt = len(x)
    return np.interp(np.linspace(0, npt, nbin + 1),
                     np.arange(npt),
                     np.sort(x))

x = np.random.randn(100)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10))

This solution gives a histogram with equal height bins, because---by definition---a histogram is a count of the number of points in each bin.

To get a pdf (i.e. density function) use the normed=True kwarg to plt.hist. As described in my other answer.

Community
  • 1
  • 1
farenorth
  • 10,165
  • 2
  • 39
  • 45
  • I've tried using your solution, but for some reason the height of all of my bins is the same. Do you know why this is? I'd expect them to definitely be different, no? I've edited my question to include a picture. – Apollo Sep 10 '16 at 20:50
  • @Apollo, tthe method is effectively changing the bin domain coverage to achieve exactly that. So that across the 'bins' you have the same count. Otherwise with a prefixed bin size, some bins would have more or less points in them, but this way each bin has the same total number; what can appear counter-intuitive is if you attempted to integrate over that domain. – Vass Aug 13 '18 at 03:17
  • @farenorth, great answer! Vectorized and elegant. A quick REPL of `np.historgram(x, histedges_equalN(x, 10))` clearly showed me the results achieving what OP (and I) was looking for. – benjaminmgross Jan 06 '19 at 15:35
  • 4
    feels like this is really a [quantile](https://en.wikipedia.org/wiki/Quantile) question, and solution is reproducing np.quantile functionality: `np.quantile(x, np.linspace(0,1,nbin+1))` does the trick – keith gould Mar 06 '19 at 17:56
1

provide bins to histogram:

bins=points[0::len(points)/10]

and then

n, bins, patches = plt.hist(points, bins=bins)

(provided points is sorted)

jlarsch
  • 2,217
  • 4
  • 22
  • 44
  • This almost works, but if the stride isn't exact; the last bin will be missed. In that case you can't just append it either because the number of elements won't necessarily be close to the stride. For instance, if you have 100 elements and want 12 bins you will have a stride of 8 and will end up only accounting for 97 of the 100 elements. If you add the last point to "bins" that bin will only contain 3 elements. – Shawn Mar 15 '17 at 20:42
0

Here I wrote an example on how you could get the result. My approach uses the data points to get the bins that will be passed to np.histogram to construct the histogram. Hence the need to sort the data using np.argsort(x). The number of points per bin can be controlled with npoints. As an example, I construct two histograms using this method. One where the weights of all points is the same, so that the height of the histogram is always constant (and equal to npoints). The other where the "weight" of each point is drawn from a uniform random distribution (see mass array). As expected, the boxes of the histogram are not equal anymore. However, the Poisson error per bin is the same.

x = np.random.rand(1000)
mass = np.random.rand(1000)
npoints = 200
ksort = np.argsort(x)

#Here I get the bins from the data set.
#Note that data need to be sorted
bins=x[ksort[0::npoints]]
bins=np.append(bins,x[ksort[-1]])


fig = plt.figure(1,figsize=(10,5))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

#Histogram where each data 
yhist, xhist = np.histogram(x, bins, weights=None)
ax1.plot(0.5*(xhist[1:]+xhist[:-1]), yhist, linestyle='steps-mid', lw=2, color='k')

yhist, xhist = np.histogram(x, bins, weights=mass)
ax2.plot(0.5*(xhist[1:]+xhist[:-1]), yhist, linestyle='steps-mid', lw=2, color='k')

ax1.set_xlabel('x', size=15)
ax1.set_ylabel('Number of points per bin', size=15)

ax2.set_xlabel('x', size=15)
ax2.set_ylabel('Mass per bin', size=15)

enter image description here

Alejandro
  • 3,263
  • 2
  • 22
  • 38
0

this solution is not as elegant, but it works for me. Hope it helps

def pyAC(x, npoints = 10, RetType='abs'):
    x = np.sort(x)
    ksort = np.argsort(x)
    binCount = int(len(x)/npoints) #number of data points in each bin
    bins = np.zeros(npoints) #initialize the bins values
    binsX = np.zeros(npoints)
    for i in range(0, npoints, 1):
        bins[i] = x[(i+1) * binCount]
        for j in range(((binCount * i) + 1), (binCount * (i+1)), 1):
            binsX[i] = x[j] + binsX[i]
    binsX = binsX/binCount  
    return pd.DataFrame({'bins':bins, 'binsX':binsX})
Kiann
  • 531
  • 1
  • 6
  • 20