Is there a parameter in matplotlib/pandas to have the Y axis of a histogram as percentage?

Question

I would like to compare two histograms by having the Y axis show the percentage of each column from the overall dataset size instead of an absolute value. Is that possible? I am using Pandas and matplotlib. Thanks

Thanks! for some reason that option is not documented at http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.hist.html#pandas.DataFrame.hist . I am however getting values on the Y axis that are equal and great than 1 (e.g., 1.4). Any idea how that's possible? My intuition was that once normalized the values should be between 0-1. — d1337, Jul 26 '13 at 07:38
The 'normed' kwarg is deprecated, and has been replaced by the 'density' — InLaw, Aug 18 '18 at 15:34

score 102 · Accepted Answer · edited Feb 19 '19 at 17:44

102

The density=True (normed=True for matplotlib < 2.2.0) returns a histogram for which np.sum(pdf * np.diff(bins)) equals 1. If you want the sum of the histogram to be 1 you can use Numpy's histogram() and normalize the results yourself.

x = np.random.randn(30)

fig, ax = plt.subplots(1,2, figsize=(10,4))

ax[0].hist(x, density=True, color='grey')

hist, bins = np.histogram(x)
ax[1].bar(bins[:-1], hist.astype(np.float32) / hist.sum(), width=(bins[1]-bins[0]), color='grey')

ax[0].set_title('normed=True')
ax[1].set_title('hist = hist / hist.sum()')

enter image description here

Btw: Strange plotting glitch at the first bin of the left plot.

edited Feb 19 '19 at 17:44

Yossarian

5,226
1
37
59

answered Jul 26 '13 at 09:01

Rutger Kassies

61,630
17
112
97

1

awesome (and such a good example of how to use subfigures, too) – grasshopper Nov 12 '13 at 11:51
4

Could you please explain why pandas behave in this way? I'm little confused. I think most people would go for the sum =1 way. – ZK Zhao Oct 06 '16 at 13:57
10

**normed** is Deprecated in matplotlib version 2.2.0 ; use the **density** keyword argument instead. https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html – Sherlock Mar 19 '18 at 08:28
3

The argument `density=True` does not normalize the histogram by the total count. That is, the heights of bars will not sum to 1 (it's rather the height*width that sums to 1 when `density=True`, and this is not what people think when they say normalize the histogram). To normalize histogram, see this https://github.com/matplotlib/matplotlib/issues/10398/#issuecomment-366021979 or this https://stackoverflow.com/a/16399202 – Princy Apr 10 '21 at 00:16
It's interesting that `normed=True` has been deprecated. It was an intuitive and useful param; I used to use it frequently. – lmart999 Feb 21 '22 at 23:51

score 33 · Answer 2 · edited Aug 18 '18 at 16:11

Pandas plotting can accept any extra keyword arguments from the respective matplotlib function. So for completeness from the comments of others here, this is how one would do it:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(100,2), columns=list('AB'))

df.hist(density=1)

Also, for direct comparison this may be a good way as well:

df.plot(kind='hist', density=1, bins=20, stacked=False, alpha=.5)

score 23 · Answer 3 · edited May 23 '17 at 12:26

23

Looks like @CarstenKönig found the right way:

df.hist(bins=20, weights=np.ones_like(df[df.columns[0]]) * 100. / len(df))

edited May 23 '17 at 12:26

Community

1
1

answered Jan 13 '16 at 01:34

hobs

18,473
10
83
106

2

I think the `100` has been missplaced. The correct version is `df.hist(bins=20, weights=np.ones_like(df[df.columns[0]]) * 100. / len(df))`, in case you were thinking of something from 0 to 100. – fcpenha Apr 07 '17 at 18:54
Indeed. Good catch! Corrected. – hobs Apr 08 '17 at 03:28

score 19 · Answer 4 · answered Nov 20 '19 at 04:09

19

I know this answer is 6 years later but to anyone using density=True (the substitute for the normed=True), this is not doing what you might want to. It will normalize the whole distribution so that the area of the bins is 1. So if you have more bins with a width < 1 you can expect the height to be > 1 (y-axis). If you want to bound your histogram to [0;1] you will have to calculate it yourself.

answered Nov 20 '19 at 04:09

anon

188
1
13

1

Or maybe `df["col"].plot.hist(ax=ax, cumulative=true, weights=list(100*numpy.ones_like(df.index)/len(df.index)))` together with something like `ax.yaxis.set_major_formatter(plt.FuncFormatter('{:.0f}%'.format))`. Not a walk in the park, but could be made to work. – PatrickT Dec 28 '21 at 05:16

score 13 · Answer 5 · answered May 13 '17 at 08:45

13

You can simplify the weighting using np.ones_like():

df["ColumnName"].plot.hist(weights = np.ones_like(df.index) / len(df.index))

np.ones_like() is okay with the df.index structure
len(df.index) is faster for large DataFrames

answered May 13 '17 at 08:45

Christoph Schranz

797
7
6

2

For some reason that command gave me the error `ValueError: weights should have the same shape as x` (matplotlib 3.0.3). The command that worked for me was `df["ColumnName"].plot.hist(weights = list(np.ones_like(df.index) / len(df.index)))` – Jean Paul Sep 02 '19 at 11:07

score 4 · Answer 6 · answered Jun 30 '21 at 14:25

4

I see this is an old question but it shows up on top for some searches, so I think as of 2021 seaborn would be an easy way to do this.

You can do something like this:

import seaborn as sns
sns.histplot(df,stat="probability")

answered Jun 30 '21 at 14:25

Misam Abbas

141
1
5

score 1 · Answer 7 · answered Aug 25 '22 at 12:40

1

In some scenarios you can adapt with a barplot:

tweets_df['label'].value_counts(normalize=True).plot(figsize=(12,12), kind='bar')

answered Aug 25 '22 at 12:40

João Vitor Gomes

317
3
12

Is there a parameter in matplotlib/pandas to have the Y axis of a histogram as percentage?

7 Answers7

Linked