0

I am trying to make a very simple histogram with matplotlib.pyplot.hist, and it seems not to be counting properly the number of values in each bin. Here is my code:

    import numpy as np
    import matplotlib.pyplot as plt
    plt.hist([.2,.3,.5,.6],bins=np.arange(0,1.1,.1))

I am dividing the interval [0,1] in bins of width .1, so I should get four bars of height 1. But the output figure consists of only two bars of height 2: it is counting the .3 value as part of the [.2,.3) bin and, similarly, it is counting the .6 value as part of the [.5,.6) bin. I have tried it both on Spyder and Google Colab. Anyone knows what's going on? Thanks!

  • The problem is that the values fall just on the boundaries of the bins. Floating point rounding can put them in either the previous or the next bin. You need `plt.hist([.2,.3,.5,.6],bins=np.arange(-0.05,1.1,.1))` for nicely separated bins. Note that matplotlib's histogram is primarily meant for continuous distributions where floating point rounding doesn't have such large effects. – JohanC Aug 31 '20 at 23:03

3 Answers3

1

The problem is that the values fall just on the boundaries of the bins. Floating point rounding can put them in either the previous or the next bin. You need bin boundaries nicely in-between the data points. Note that matplotlib's histogram is primarily meant for continuous distributions where floating point rounding doesn't have such large effects.

Here is some code to illustrate what's happening in both situations:

import numpy as np
import matplotlib.pyplot as plt

data = [.2, .3, .5, .6]

fig, axes = plt.subplots(ncols=2, figsize=(12, 4))

for ax in axes:
    if ax == axes[0]:
        bins = np.arange(0, 1.1, .1)
        ax.set_title('data on bin boundaries')
    else:
        bins = np.arange(-0.05, 1.1, .1)
        ax.set_title('data between bin boundaries')
    values, bin_bounds, bars = ax.hist(data, bins=bins, alpha=0.3)

    ax.vlines(bin_bounds, 0, max(values), color='crimson', ls=':')
    ax.scatter(data, np.full_like(data, 0.5), color='lime', s=30)
    ax.set_ylim(0, 2.2)
    ax.set_yticks(range(3))
plt.show()

illustrating plot

JohanC
  • 71,591
  • 8
  • 33
  • 66
1

Another way to work around this issue seems to be to use the same floating point precision for your input data as the histogram uses internally to assign the numbers to bins.

Normally Python uses 64-bit floats, but this histogram implementation seems to assign the bins after converting them to 32-bit precision.

Therefore, it seems to be possible to achieve the expected result by explicitly inserting 32-bit floats:

import numpy as np
import matplotlib.pyplot as plt
data = np.array([.2,.3,.5,.6], dtype=np.float32)
plt.hist(data, bins=np.arange(0.0, 1.1, 0.1))
moooeeeep
  • 31,622
  • 22
  • 98
  • 187
0

From the docs:

If bins is a sequence, it defines the bin edges, including the left edge of the first bin and the right edge of the last bin; in this case, bins may be unequally spaced. All but the last (righthand-most) bin is half-open. In other words, if bins is:

[1, 2, 3, 4]

then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.

Because the intervales are closed - opened, both .2 and .3 fall in the same bin, and .5 and .6 in another bin.

You should fix the bins by moving the boundaries a little to avoid the numbers falling on the edges.

89f3a1c
  • 1,430
  • 1
  • 14
  • 24
  • Thanks, but what you're saying is not correct: precisely because the intervals are half-open, in my setting .2 and .3 do not fall in the same bin. As JohanC pointed out, the origin of my problem is floating-point rounding. – Guillem Pérez-Nadal Aug 31 '20 at 23:59