91

I would like to compare two histograms by having the Y axis show the percentage of each column from the overall dataset size instead of an absolute value. Is that possible? I am using Pandas and matplotlib. Thanks

Yushan ZHANG
  • 523
  • 5
  • 18
d1337
  • 2,543
  • 6
  • 24
  • 22
  • 10
    Add `normed=True` to your `plt.hist()`. – Rutger Kassies Jul 26 '13 at 06:26
  • 1
    Thanks! for some reason that option is not documented at http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.hist.html#pandas.DataFrame.hist . I am however getting values on the Y axis that are equal and great than 1 (e.g., 1.4). Any idea how that's possible? My intuition was that once normalized the values should be between 0-1. – d1337 Jul 26 '13 at 07:38
  • 5
    The 'normed' kwarg is deprecated, and has been replaced by the 'density' – InLaw Aug 18 '18 at 15:34

7 Answers7

102

The density=True (normed=True for matplotlib < 2.2.0) returns a histogram for which np.sum(pdf * np.diff(bins)) equals 1. If you want the sum of the histogram to be 1 you can use Numpy's histogram() and normalize the results yourself.

x = np.random.randn(30)

fig, ax = plt.subplots(1,2, figsize=(10,4))

ax[0].hist(x, density=True, color='grey')

hist, bins = np.histogram(x)
ax[1].bar(bins[:-1], hist.astype(np.float32) / hist.sum(), width=(bins[1]-bins[0]), color='grey')

ax[0].set_title('normed=True')
ax[1].set_title('hist = hist / hist.sum()')

enter image description here

Btw: Strange plotting glitch at the first bin of the left plot.

Yossarian
  • 5,226
  • 1
  • 37
  • 59
Rutger Kassies
  • 61,630
  • 17
  • 112
  • 97
  • 1
    awesome (and such a good example of how to use subfigures, too) – grasshopper Nov 12 '13 at 11:51
  • 4
    Could you please explain why pandas behave in this way? I'm little confused. I think most people would go for the sum =1 way. – ZK Zhao Oct 06 '16 at 13:57
  • 10
    **normed** is Deprecated in matplotlib version 2.2.0 ; use the **density** keyword argument instead. https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html – Sherlock Mar 19 '18 at 08:28
  • 3
    The argument `density=True` does not normalize the histogram by the total count. That is, the heights of bars will not sum to 1 (it's rather the height*width that sums to 1 when `density=True`, and this is not what people think when they say normalize the histogram). To normalize histogram, see this https://github.com/matplotlib/matplotlib/issues/10398/#issuecomment-366021979 or this https://stackoverflow.com/a/16399202 – Princy Apr 10 '21 at 00:16
  • It's interesting that `normed=True` has been deprecated. It was an intuitive and useful param; I used to use it frequently. – lmart999 Feb 21 '22 at 23:51
33

Pandas plotting can accept any extra keyword arguments from the respective matplotlib function. So for completeness from the comments of others here, this is how one would do it:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(100,2), columns=list('AB'))

df.hist(density=1)

Also, for direct comparison this may be a good way as well:

df.plot(kind='hist', density=1, bins=20, stacked=False, alpha=.5)
InLaw
  • 2,537
  • 2
  • 21
  • 33
ryanskeith
  • 533
  • 4
  • 11
23

Looks like @CarstenKönig found the right way:

df.hist(bins=20, weights=np.ones_like(df[df.columns[0]]) * 100. / len(df))
Community
  • 1
  • 1
hobs
  • 18,473
  • 10
  • 83
  • 106
  • 2
    I think the `100` has been missplaced. The correct version is `df.hist(bins=20, weights=np.ones_like(df[df.columns[0]]) * 100. / len(df))`, in case you were thinking of something from 0 to 100. – fcpenha Apr 07 '17 at 18:54
  • Indeed. Good catch! Corrected. – hobs Apr 08 '17 at 03:28
19

I know this answer is 6 years later but to anyone using density=True (the substitute for the normed=True), this is not doing what you might want to. It will normalize the whole distribution so that the area of the bins is 1. So if you have more bins with a width < 1 you can expect the height to be > 1 (y-axis). If you want to bound your histogram to [0;1] you will have to calculate it yourself.

anon
  • 188
  • 1
  • 13
  • 1
    Or maybe `df["col"].plot.hist(ax=ax, cumulative=true, weights=list(100*numpy.ones_like(df.index)/len(df.index)))` together with something like `ax.yaxis.set_major_formatter(plt.FuncFormatter('{:.0f}%'.format))`. Not a walk in the park, but could be made to work. – PatrickT Dec 28 '21 at 05:16
13

You can simplify the weighting using np.ones_like():

df["ColumnName"].plot.hist(weights = np.ones_like(df.index) / len(df.index))
  • np.ones_like() is okay with the df.index structure
  • len(df.index) is faster for large DataFrames
  • 2
    For some reason that command gave me the error `ValueError: weights should have the same shape as x` (matplotlib 3.0.3). The command that worked for me was `df["ColumnName"].plot.hist(weights = list(np.ones_like(df.index) / len(df.index)))` – Jean Paul Sep 02 '19 at 11:07
4

I see this is an old question but it shows up on top for some searches, so I think as of 2021 seaborn would be an easy way to do this.

You can do something like this:

import seaborn as sns
sns.histplot(df,stat="probability")
Misam Abbas
  • 141
  • 1
  • 5
1

In some scenarios you can adapt with a barplot:

tweets_df['label'].value_counts(normalize=True).plot(figsize=(12,12), kind='bar')