How can I compute a histogram (frequency table) for a single Series?

Question

How can I generate a frequency table (or histogram) for a single Series? For example, if I have my_series = pandas.Series([1,2,2,3,3,3]), how can I get a result like {1: 1, 2: 2, 3: 3} - that is, a count of how many times each value appears in the Series?

DSM · Accepted Answer · 2012-09-26T20:23:30.963

172

Maybe .value_counts()?

>>> import pandas
>>> my_series = pandas.Series([1,2,2,3,3,3, "fred", 1.8, 1.8])
>>> my_series
0       1
1       2
2       2
3       3
4       3
5       3
6    fred
7     1.8
8     1.8
>>> counts = my_series.value_counts()
>>> counts
3       3
2       2
1.8     2
fred    1
1       1
>>> len(counts)
5
>>> sum(counts)
9
>>> counts["fred"]
1
>>> dict(counts)
{1.8: 2, 2: 2, 3: 3, 1: 1, 'fred': 1}

edited Sep 26 '12 at 20:23

answered Aug 31 '12 at 00:14

DSM

342,061
65
592
494

6

`.value_counts().sort_index(1)` , to prevent the first column possibly getting slightly out-of-order – smci Apr 17 '13 at 12:12
10

Is there an equivalent for DataFrame, rather than series? I tried running .value_counts() on a df and got `AttributeError: 'DataFrame' object has no attribute 'value_counts'` – Mittenchops May 03 '13 at 14:07
@Mittenchops >> value_counts on dataframe – Shankar ARUL Jan 28 '15 at 12:30
2

Is there an easy way to convert these counts to proportions? – dsaxton Jul 31 '15 at 23:53
`my_series.value_counts() / np.sum(my_series.value_counts())` – Eoin Sep 02 '15 at 12:20
@Mittenchops you should use value_counts on a series object, alternatively i think if you have a dataframe `df`use `df.apply(lambda x: x.value_counts(dropna=False))` – latorrefabian Jan 22 '16 at 00:46
dropna=False will also count na's – latorrefabian Jan 22 '16 at 00:46
8

@dsaxton you can use .value_counts(normalize=True) to convert the results to proportions – Max Power Nov 30 '16 at 21:01
3

To use this on a dataframe instead, convert into it's equivalent 1-D numpy array representation, like - `pd.value_counts(df.values.ravel())` which returns a series whose `index` and `values` attributes contains the unique elements and their counts respectively. – Nickil Maveli Dec 20 '16 at 10:04

score 12 · Answer 2 · answered Jan 28 '15 at 12:28

You can use list comprehension on a dataframe to count frequencies of the columns as such

[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]

Breakdown:

my_series.select_dtypes(include=['O'])

Selects just the categorical data

list(my_series.select_dtypes(include=['O']).columns)

Turns the columns from above into a list

[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]

Iterates through the list above and applies value_counts() to each of the columns

score 9 · Answer 3 · answered Dec 13 '17 at 18:45

The answer provided by @DSM is simple and straightforward, but I thought I'd add my own input to this question. If you look at the code for pandas.value_counts, you'll see that there is a lot going on.

If you need to calculate the frequency of many series, this could take a while. A faster implementation would be to use numpy.unique with return_counts = True

Here is an example:

import pandas as pd
import numpy as np

my_series = pd.Series([1,2,2,3,3,3])

print(my_series.value_counts())
3    3
2    2
1    1
dtype: int64

Notice here that the item returned is a pandas.Series

In comparison, numpy.unique returns a tuple with two items, the unique values and the counts.

vals, counts = np.unique(my_series, return_counts=True)
print(vals, counts)
[1 2 3] [1 2 3]

You can then combine these into a dictionary:

results = dict(zip(vals, counts))
print(results)
{1: 1, 2: 2, 3: 3}

And then into a pandas.Series

print(pd.Series(results))
1    1
2    2
3    3
dtype: int64

Harshit Jain · Answer 4 · 2020-06-11T04:43:14.317

for frequency distribution of a variable with excessive values you can collapse down the values in classes,

Here I excessive values for employrate variable, and there's no meaning of it's frequency distribution with direct values_count(normalize=True)

                country  employrate alcconsumption
0           Afghanistan   55.700001            .03
1               Albania   11.000000           7.29
2               Algeria   11.000000            .69
3               Andorra         nan          10.17
4                Angola   75.699997           5.57
..                  ...         ...            ...
208             Vietnam   71.000000           3.91
209  West Bank and Gaza   32.000000               
210         Yemen, Rep.   39.000000             .2
211              Zambia   61.000000           3.56
212            Zimbabwe   66.800003           4.96

[213 rows x 3 columns]

frequency distribution with values_count(normalize=True) with no classification,length of result here is 139 (seems meaningless as a frequency distribution):

print(gm["employrate"].value_counts(sort=False,normalize=True))

50.500000   0.005618
61.500000   0.016854
46.000000   0.011236
64.500000   0.005618
63.500000   0.005618

58.599998   0.005618
63.799999   0.011236
63.200001   0.005618
65.599998   0.005618
68.300003   0.005618
Name: employrate, Length: 139, dtype: float64

putting classification we put all values with a certain range ie.

0-10 as 1,
11-20 as 2  
21-30 as 3, and so forth.

gm["employrate"]=gm["employrate"].str.strip().dropna()  
gm["employrate"]=pd.to_numeric(gm["employrate"])
gm['employrate'] = np.where(
   (gm['employrate'] <=10) & (gm['employrate'] > 0) , 1, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=20) & (gm['employrate'] > 10) , 1, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=30) & (gm['employrate'] > 20) , 2, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=40) & (gm['employrate'] > 30) , 3, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=50) & (gm['employrate'] > 40) , 4, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=60) & (gm['employrate'] > 50) , 5, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=70) & (gm['employrate'] > 60) , 6, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=80) & (gm['employrate'] > 70) , 7, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=90) & (gm['employrate'] > 80) , 8, gm['employrate']
   )
gm['employrate'] = np.where(
   (gm['employrate'] <=100) & (gm['employrate'] > 90) , 9, gm['employrate']
   )
print(gm["employrate"].value_counts(sort=False,normalize=True))

after classification we have a clear frequency distribution. here we can easily see, that 37.64% of countries have employ rate between 51-60% and 11.79% of countries have employ rate between 71-80%

5.000000   0.376404
7.000000   0.117978
4.000000   0.179775
6.000000   0.264045
8.000000   0.033708
3.000000   0.028090
Name: employrate, dtype: float64

How can I compute a histogram (frequency table) for a single Series?

4 Answers4

Linked

Related