Count the frequency that a value occurs in a dataframe column

Question

I have a dataset

category
cat a
cat b
cat a

I'd like to return something like the following which shows the unique values and their frequencies

category   freq 
cat a       2
cat b       1

Are you looking for [`df["category"].value_counts()`](http://stackoverflow.com/questions/12207326/pandas-frequency-table-for-a-single-variable)? — DSM, Mar 13 '14 at 21:42

score 664 · Accepted Answer · edited Jun 04 '22 at 02:06

664

Use value_counts() as @DSM commented.

In [37]:
df = pd.DataFrame({'a':list('abssbab')})
df['a'].value_counts()

Out[37]:

b    3
a    2
s    2
dtype: int64

Also groupby and count. Many ways to skin a cat here.

In [38]:
df.groupby('a').count()

Out[38]:

   a
a   
a  2
b  3
s  2

[3 rows x 1 columns]

See the online docs.

If you wanted to add frequency back to the original dataframe use transform to return an aligned index:

In [41]:
df['freq'] = df.groupby('a')['a'].transform('count')
df

Out[41]:

   a freq
0  a    2
1  b    3
2  s    2
3  s    2
4  b    3
5  a    2
6  b    3

[7 rows x 2 columns]

edited Jun 04 '22 at 02:06

wjandrea

28,235
9
60
81

answered Mar 13 '14 at 21:41

EdChum

376,765
198
813
562

`df.groupby('a').count()` doesn't work if you have multiple columns. It will give you a ncol x nvals dataframe. That's nice when you only have one column, but when you have 10's or 100's of columns, the result is probably not what you are looking for. – Olsgaard May 31 '22 at 16:49
That `groupby` solution doesn't work; you just get an empty df out. Seems like it used to work in the past though, so I edited the answer to move it down, but left it in for posterity. – wjandrea Jun 04 '22 at 02:08
`df.groupby('a')['a'].count()` does work though, or `df.groupby('a').size()` from [Vidhya's answer](/a/38703649/4518341) – wjandrea Jun 04 '22 at 22:00

score 125 · Answer 2 · answered Apr 05 '16 at 18:30

125

If you want to apply to all columns you can use:

df.apply(pd.value_counts)

This will apply a column based aggregation function (in this case value_counts) to each of the columns.

answered Apr 05 '16 at 18:30

Arran Cudbard-Bell

5,912
2
26
48

score 107 · Answer 3 · edited Jan 08 '19 at 16:52

107

df.category.value_counts()

This short little line of code will give you the output you want.

If your column name has spaces you can use

df['category'].value_counts()

edited Jan 08 '19 at 16:52

TrebledJ

8,713
7
26
48

answered Jan 15 '18 at 17:52

Satyajit Dhawale

1,102
1
7
5

5

Or use **[]** if the column name has space. `df['category 1'].value_counts()` – Jacob Kalakal Joseph Oct 11 '18 at 23:09

Roman Kazakov · Answer 4 · 2019-05-16T14:59:04.930

24

df.apply(pd.value_counts).fillna(0)

value_counts - Returns object containing counts of unique values

apply - count frequency in every column. If you set axis=1, you get frequency in every row

fillna(0) - make output more fancy. Changed NaN to 0

edited May 16 '19 at 14:59

answered Apr 09 '17 at 12:19

Roman Kazakov

697
7
12

score 21 · Answer 5 · answered Aug 01 '16 at 16:23

In 0.18.1 groupby together with count does not give the frequency of unique values:

>>> df
   a
0  a
1  b
2  s
3  s
4  b
5  a
6  b

>>> df.groupby('a').count()
Empty DataFrame
Columns: []
Index: [a, b, s]

However, the unique values and their frequencies are easily determined using size:

>>> df.groupby('a').size()
a
a    2
b    3
s    2

With df.a.value_counts() sorted values (in descending order, i.e. largest value first) are returned by default.

score 7 · Answer 6 · edited May 23 '17 at 12:34

7

Using list comprehension and value_counts for multiple columns in a df

[my_series[c].value_counts() for c in list(my_series.select_dtypes(include=['O']).columns)]

https://stackoverflow.com/a/28192263/786326

edited May 23 '17 at 12:34

Community

1
1

answered Jan 28 '15 at 13:11

Shankar ARUL

12,642
11
68
69

score 7 · Answer 7 · answered Oct 04 '17 at 22:06

7

If your DataFrame has values with the same type, you can also set return_counts=True in numpy.unique().

index, counts = np.unique(df.values,return_counts=True)

np.bincount() could be faster if your values are integers.

answered Oct 04 '17 at 22:06

user666

5,231
2
26
35

score 7 · Answer 8 · answered Aug 04 '21 at 17:09

As everyone said, the faster solution is to do:

df.column_to_analyze.value_counts()

But if you want to use the output in your dataframe, with this schema:

df input:

category
cat a
cat b
cat a

df output: 

category   counts
cat a        2
cat b        1 
cat a        2

you can do this:

df['counts'] = df.category.map(df.category.value_counts())
df

score 5 · Answer 9 · answered Mar 27 '17 at 22:05

Without any libraries, you could do this instead:

def to_frequency_table(data):
    frequencytable = {}
    for key in data:
        if key in frequencytable:
            frequencytable[key] += 1
        else:
            frequencytable[key] = 1
    return frequencytable

Example:

to_frequency_table([1,1,1,1,2,3,4,4])
>>> {1: 4, 2: 1, 3: 1, 4: 2}

score 5 · Answer 10 · answered May 18 '18 at 16:08

You can also do this with pandas by broadcasting your columns as categories first, e.g. dtype="category" e.g.

cats = ['client', 'hotel', 'currency', 'ota', 'user_country']

df[cats] = df[cats].astype('category')

and then calling describe:

df[cats].describe()

This will give you a nice table of value counts and a bit more :):

    client  hotel   currency    ota user_country
count   852845  852845  852845  852845  852845
unique  2554    17477   132 14  219
top 2198    13202   USD Hades   US
freq    102562  8847    516500  242734  340992

djoguns · Answer 11 · 2019-05-02T12:35:58.417

3

I believe this should work fine for any DataFrame columns list.

def column_list(x):
    column_list_df = []
    for col_name in x.columns:
        y = col_name, len(x[col_name].unique())
        column_list_df.append(y)
return pd.DataFrame(column_list_df)

column_list_df.rename(columns={0: "Feature", 1: "Value_count"})

The function "column_list" checks the columns names and then checks the uniqueness of each column values.

edited May 02 '19 at 12:35

answered May 02 '19 at 12:26

djoguns

96
4

You can add a brief explanation of how your code works to improve your answer. – DobromirM May 02 '19 at 12:32

score 2 · Answer 12 · answered Dec 25 '18 at 08:34

@metatoaster has already pointed this out. Go for Counter. It's blazing fast.

import pandas as pd
from collections import Counter
import timeit
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10000, (100, 2)), columns=["NumA", "NumB"])

Timers

%timeit -n 10000 df['NumA'].value_counts()
# 10000 loops, best of 3: 715 µs per loop

%timeit -n 10000 df['NumA'].value_counts().to_dict()
# 10000 loops, best of 3: 796 µs per loop

%timeit -n 10000 Counter(df['NumA'])
# 10000 loops, best of 3: 74 µs per loop

%timeit -n 10000 df.groupby(['NumA']).count()
# 10000 loops, best of 3: 1.29 ms per loop

Cheers!

Well, it depends very much on the size of the dataframe: if you run the benchmark (thanks for sharing!) for `(100000, 2)`, pandas' `value_counts()` becomes the fastest option. — Michael Dorner, Mar 19 '22 at 10:02

score 1 · Answer 13 · answered Jun 27 '21 at 20:21

The following code creates frequency table for the various values in a column called "Total_score" in a dataframe called "smaller_dat1", and then returns the number of times the value "300" appears in the column.

valuec = smaller_dat1.Total_score.value_counts()
valuec.loc[300]

score 0 · Answer 14 · edited Aug 14 '18 at 01:12

0

n_values = data.income.value_counts()

First unique value count

n_at_most_50k = n_values[0]

Second unique value count

n_greater_50k = n_values[1]

n_values

Output:

<=50K    34014
>50K     11208

Name: income, dtype: int64

Output:

n_greater_50k,n_at_most_50k:-
(11208, 34014)

edited Aug 14 '18 at 01:12

girlvsdata

1,596
11
21

answered Jun 28 '18 at 09:01

RAHUL KUMAR

1,123
11
9

score 0 · Answer 15 · answered Jan 27 '19 at 07:06

0

your data:

|category|
cat a
cat b
cat a

solution:

 df['freq'] = df.groupby('category')['category'].transform('count')
 df =  df.drop_duplicates()

answered Jan 27 '19 at 07:06

Rahul Jain

3,065
1
13
17

score 0 · Answer 16 · answered Apr 13 '23 at 05:52

TL;DR: `value_counts()` is the way to go. All other methods are inferior

On top of it being idiomatic and easy to call, here are a couple more reasons why it should be used.

1. It is faster than other methods

If you look at the performance plots below, for most of the native pandas dtypes, value_counts() is the most efficient (or equivalent to) option.¹ In particular, it's faster than both groupby.size and groupby.count for all dtypes.

2. It can make bins for histograms

You can not only count the frequency of each value, you can bin them in one go. For other options, you'll have to go through an additional pd.cut step to get the same output. For example,

df = pd.DataFrame({'col': range(100)})

bins = [-float('inf'), 33, 66, float('inf')]
df['col'].value_counts(bins=bins, sort=False)


(-inf, 33.0]    34
(33.0, 66.0]    33
(66.0, inf]     33
Name: col, dtype: int64

3. More general than `groupby.count` or `groupby.size`

The main difference between groupby.count and groupby.size is that count counts only non-NaN values while size returns the length (which includes NaN), if the column has NaN values.

value_counts() is equivalent to groupby.count by default but can become equivalent to groupby.size if dropna=False, i.e. df['col'].value_counts(dropna=False).

¹ Code used to produce the perfplot:

import numpy as np
import pandas as pd
from collections import Counter
import perfplot
import matplotlib.pyplot as plt

gen = lambda N: pd.DataFrame({'col': np.random.default_rng().integers(N, size=N)})
setup_funcs = [
    ('numeric', lambda N: gen(N)),
    ('nullable integer dtype', lambda N: gen(N).astype('Int64')),
    ('object', lambda N: gen(N).astype(str)),
    ('string (extension dtype)', lambda N: gen(N).astype('string')),
    ('categorical', lambda N: gen(N).astype('category')),
    ('datetime', lambda N: pd.DataFrame({'col': np.resize(pd.date_range('2020', '2024', N//10+1), N)}))
]

fig, axs = plt.subplots(3, 2, figsize=(15, 15), facecolor='white', constrained_layout=True)
for i, funcs in enumerate(zip(*[iter(setup_funcs)]*2)):
    for j, (label, func) in enumerate(funcs):
        plt.sca(axs[i, j])
        perfplot.plot(
            setup=func,
            kernels=[
                lambda df: df['col'].value_counts(sort=False),
                lambda df: pd.Series(*reversed(np.unique(df['col'], return_counts=True))),
                lambda df: pd.Series(Counter(df['col'])),
                lambda df: df.groupby('col').size(),
                lambda df: df.groupby('col')['col'].count()
            ],
            labels=['value_counts', 'np.unique', 'Counter', 'groupby.size', 'groupby.count'],
            n_range=[2**k for k in range(21)],
            xlabel='len(df)',
            title=f'Count frequency in {label} column',
            equality_check=lambda x,y: x.eq(y.loc[x.index]).all()
        );

Count the frequency that a value occurs in a dataframe column

16 Answers16

Timers

TL;DR: `value_counts()` is the way to go. All other methods are inferior

1. It is faster than other methods

2. It can make bins for histograms

3. More general than `groupby.count` or `groupby.size`

Linked

Related

Count the frequency that a value occurs in a dataframe column

16 Answers16

Timers

TL;DR: value_counts() is the way to go. All other methods are inferior

1. It is faster than other methods

2. It can make bins for histograms

3. More general than groupby.count or groupby.size

Linked

Related

TL;DR: `value_counts()` is the way to go. All other methods are inferior

3. More general than `groupby.count` or `groupby.size`