197

I'm using matplotlib to make a histogram.

Is there any way to manually set the size of the bins as opposed to the number of bins?

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Sam Creamer
  • 5,187
  • 13
  • 34
  • 49

9 Answers9

343

Actually, it's quite easy: instead of the number of bins you can give a list with the bin boundaries. They can be unequally distributed, too:

plt.hist(data, bins=[0, 10, 20, 30, 40, 50, 100])

If you just want them equally distributed, you can simply use range:

plt.hist(data, bins=range(min(data), max(data) + binwidth, binwidth))

Added to original answer

The above line works for data filled with integers only. As macrocosme points out, for floats you can use:

import numpy as np
plt.hist(data, bins=np.arange(min(data), max(data) + binwidth, binwidth))
Community
  • 1
  • 1
CodingCat
  • 4,999
  • 10
  • 37
  • 59
  • 29
    replace range(...) with np.arange(...) to get it to work with floats. – macrocosme Aug 25 '14 at 08:42
  • Additional question, how can I drow the axis to see the value of each bin? Now I can only see `10..20..30..` – ZK Zhao Aug 10 '15 at 06:08
  • 9
    what is the binwidth here?have u set that value before? – UserYmY Sep 29 '15 at 13:25
  • 4
    I believe binwidth in this example could be found by: `(data.max() - data.min()) / number_of_bins_you_want`. The `+ binwidth` could be changed to just `1` to make this a more easily understood example. – Jarad Jan 22 '18 at 17:31
  • 4
    Further to CodingCat's excellent solution above, for float data, if you want the histogram bars centred around integer x-ticks instead of having the bar boundaries at the x-ticks, try the following tweak: bins = np.arange(dmin - 0.5, dmax + 0.5 + binwidth, binwidth) – DaveW Aug 13 '18 at 13:59
  • 4
    option ``lw = 5, color = "white"`` or similar inserts white gaps between bars – PatrickT Nov 07 '18 at 12:19
  • I'm using a plot with only 3 values.. because of the way range works (ignoring the last number, i.e. range(1,3) generates [1,2] ) I had to add 2 bandwidth instead of 1 in the second argument. so it should be `plt.hist(data, bins=np.arange(min(data), max(data) + 2*binwidth, binwidth))` – Lucas Azevedo Feb 19 '20 at 19:35
  • @LucasAzevedo How so? If your max value lies exactly on a bin edge, it will be counted in the last bin (expected behaviour). If you do not want that (exclude that value from the previous bin like all the bins before exclude the upper edge), add something between 1 and 2 binwidths (1.1*binwidth for example). If you add 2 binwidths you run the very real risk of having another empty bin, if your max value does not exactly fall onto a bin edge. Or you simply use np.arange(1,3.1) like I do. ;) – BUFU Jul 07 '20 at 10:19
16

For N bins, the bin edges are specified by list of N+1 values where the first N give the lower bin edges and the +1 gives the upper edge of the last bin.

Code:

from numpy import np; from pylab import *

bin_size = 0.1; min_edge = 0; max_edge = 2.5
N = (max_edge-min_edge)/bin_size; Nplus1 = N + 1
bin_list = np.linspace(min_edge, max_edge, Nplus1)

Note that linspace produces array from min_edge to max_edge broken into N+1 values or N bins

aloha
  • 4,554
  • 6
  • 32
  • 40
Alef
  • 195
  • 1
  • 6
  • 3
    Note that bins are inclusive of their lower bound and exclusive of their upper bound, with the exception of the N+1 (last) bin which is inclusive of both bounds. – lukewitmer Mar 01 '16 at 17:59
  • 1
    @lukewitmer this should have been highlighted somewhere. I spent literally hours debugging my huge historgram because the graph didn't correspond to reality. I was assuming that both 0, and N+1 are either exclusive or inclusive. – kukis Jun 14 '23 at 11:41
11

I use quantiles to do bins uniform and fitted to sample:

bins=df['Generosity'].quantile([0,.05,0.1,0.15,0.20,0.25,0.3,0.35,0.40,0.45,0.5,0.55,0.6,0.65,0.70,0.75,0.80,0.85,0.90,0.95,1]).to_list()

plt.hist(df['Generosity'], bins=bins, normed=True, alpha=0.5, histtype='stepfilled', color='steelblue', edgecolor='none')

enter image description here

Wojciech Moszczyński
  • 2,893
  • 21
  • 27
  • 2
    Great idea. You could replace the list of quantiles by `np.arange(0, 1.01, 0.5)` or `np.linspace(0, 1, 21)`. There are no edges, but I understand the boxes have equal area, but different width in X axis? – Tomasz Gandor Jun 13 '20 at 20:18
  • 1
    note: normed : bool, optional Deprecated; use the density keyword argument instead. – Daniel Böckenhoff Feb 03 '23 at 14:42
6

I guess the easy way would be to calculate the minimum and maximum of the data you have, then calculate L = max - min. Then you divide L by the desired bin width (I'm assuming this is what you mean by bin size) and use the ceiling of this value as the number of bins.

Il-Bhima
  • 10,744
  • 1
  • 47
  • 51
4

I had the same issue as OP (I think!), but I couldn't get it to work in the way that Lastalda specified. I don't know if I have interpreted the question properly, but I have found another solution (it probably is a really bad way of doing it though).

This was the way that I did it:

plt.hist([1,11,21,31,41], bins=[0,10,20,30,40,50], weights=[10,1,40,33,6]);

Which creates this:

image showing histogram graph created in matplotlib

So the first parameter basically 'initialises' the bin - I'm specifically creating a number that is in between the range I set in the bins parameter.

To demonstrate this, look at the array in the first parameter ([1,11,21,31,41]) and the 'bins' array in the second parameter ([0,10,20,30,40,50]):

  • The number 1 (from the first array) falls between 0 and 10 (in the 'bins' array)
  • The number 11 (from the first array) falls between 11 and 20 (in the 'bins' array)
  • The number 21 (from the first array) falls between 21 and 30 (in the 'bins' array), etc.

Then I'm using the 'weights' parameter to define the size of each bin. This is the array used for the weights parameter: [10,1,40,33,6].

So the 0 to 10 bin is given the value 10, the 11 to 20 bin is given the value of 1, the 21 to 30 bin is given the value of 40, etc.

bluguy
  • 113
  • 1
  • 1
  • 8
  • 4
    I think you have a basic misunderstanding how the histogram function works. It expects raw data. So, in your example, your data array should contain 10 values between 0 an 10, 1 value between 10 and 20, and so on. Then the function does the summing-up AND the drawing. What you're doing above is a workaround because you already have the sums (which you then insert into the graph by misusing the "weights" option). Hope this clears up some confusion. – CodingCat Dec 01 '17 at 15:29
4

I like things to happen automatically and for bins to fall on "nice" values. The following seems to work quite well.

import numpy as np
import numpy.random as random
import matplotlib.pyplot as plt
def compute_histogram_bins(data, desired_bin_size):
    min_val = np.min(data)
    max_val = np.max(data)
    min_boundary = -1.0 * (min_val % desired_bin_size - min_val)
    max_boundary = max_val - max_val % desired_bin_size + desired_bin_size
    n_bins = int((max_boundary - min_boundary) / desired_bin_size) + 1
    bins = np.linspace(min_boundary, max_boundary, n_bins)
    return bins

if __name__ == '__main__':
    data = np.random.random_sample(100) * 123.34 - 67.23
    bins = compute_histogram_bins(data, 10.0)
    print(bins)
    plt.hist(data, bins=bins)
    plt.xlabel('Value')
    plt.ylabel('Counts')
    plt.title('Compute Bins Example')
    plt.grid(True)
    plt.show()

The result has bins on nice intervals of bin size.

[-70. -60. -50. -40. -30. -20. -10.   0.  10.  20.  30.  40.  50.  60.]

computed bins histogram

scopchanov
  • 7,966
  • 10
  • 40
  • 68
  • Excactly what I was looking for! However, in some cases n_bins is rounded down due to floating point precision. E.g. for `desired_bin_size=0.05`, `min_boundary=0.850`, `max_boundary=2.05` the calculation of `n_bins` becomes `int(23.999999999999993)` which results in 23 instead of 24 and therefore one bin too few. A rounding before integer conversion worked for me: `n_bins = int(round((max_boundary - min_boundary) / desired_bin_size, 0)) + 1` – M. Schlenker Oct 23 '19 at 11:39
2

This answer support the @ macrocosme suggestion.

I am using heat map as hist2d plot. Additionally I use cmin=0.5 for no count value and cmap for color, r represent the reverse of given color.

Describe statistics. enter image description here

# np.arange(data.min(), data.max()+binwidth, binwidth)
bin_x = np.arange(0.6, 7 + 0.3, 0.3)
bin_y = np.arange(12, 58 + 3, 3)
plt.hist2d(data=fuel_econ, x='displ', y='comb', cmin=0.5, cmap='viridis_r', bins=[bin_x, bin_y]);
plt.xlabel('Dispalcement (1)');
plt.ylabel('Combine fuel efficiency (mpg)');

plt.colorbar();

enter image description here

code_conundrum
  • 529
  • 6
  • 12
2

If you are looking on the visualization aspect also, you can add edgecolor='white', linewidth=2 and will have the binned separated :

date_binned = new_df[(new_df['k']>0)&(new_df['k']<360)]['k']
plt.hist(date_binned, bins=range(min(date_binned), max(date_binned) + binwidth, binwidth), edgecolor='white', linewidth=2)

enter image description here

DataYoda
  • 771
  • 5
  • 18
0

For a histogram with integer x-values I ended up using

plt.hist(data, np.arange(min(data)-0.5, max(data)+0.5))
plt.xticks(range(min(data), max(data)))

The offset of 0.5 centers the bins on the x-axis values. The plt.xticks call adds a tick for every integer.

Adversus
  • 2,166
  • 20
  • 23