20

I know that I can plot histogram by pandas:

df4 = pd.DataFrame({'a': np.random.randn(1000) + 1})
df4['a'].hist()

enter image description here

But how can I retrieve the histogram count from such a plot?

I know I can do it by (from Histogram values of a Pandas Series)

count,division = np.histogram(df4['a'])

But get the count value after df.hist() using this feels very redundent. Is it possible to get the frequency value directly from pandas?

Community
  • 1
  • 1
ZK Zhao
  • 19,885
  • 47
  • 132
  • 206

2 Answers2

24

The quick answer is:

pd.cut(df4['a'], 10).value_counts().sort_index()

From the documentation:

bins: integer, default 10
Number of histogram bins to be used

So look at pd.cut(df4['a'], 10).value_counts()

You see that the values are the same as from np.histogram

Jealie
  • 6,157
  • 2
  • 33
  • 36
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

This is another way to calculate a histogram in pandas. It is more complicated but IMO better since you avoid the weird stringed-bins that pd.cut returns that wreck any plot. You will also get style points for using .pipe():

(df['a']
 .pipe(lambda s: pd.Series(np.histogram(s, range=(0, 100), bins=20)))
 .pipe(lambda s: pd.Series(s[0], index=s[1][:-1]))
)

You can then pipe on more things at the end, like:

.pipe(lambda s: s/s.sum())

which will give you a distribution.

Ideally, there'd be a sensible density in pd.hist that could do this for you. Pandas does have a density=False keyword but it's nonsensical. I've read explanations a thousand times, like this one, but I've never understood it nor understood who would actually use it. 99.9% of the time when you see fractions on a histogram, you think "distribution", not np.sum(pdf * np.diff(bins)) which is what density=True actually calculates. Makes you want to weep.

Alex Spangher
  • 977
  • 2
  • 13
  • 22