220

I have a large (about 12M rows) DataFrame df:

df.columns = ['word','documents','frequency']

The following ran in a timely fashion:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

However, this is taking an unexpectedly long time to run:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?

df.word.describe()

ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
tipanverella
  • 3,477
  • 3
  • 25
  • 41

4 Answers4

326

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)

In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.

Dan Allan
  • 34,073
  • 6
  • 70
  • 63
  • 36
    Thanks. I also found this useful for speeding up counting a specific value in a series. e.g. `df.word.value_counts()['myword']` is about twice as fast as `len(df[df.word == 'myword'])`. – fantabolous Mar 10 '15 at 13:35
  • What about counting over the entire DataFrame? This works for one column. – Vaidøtas I. Jan 23 '20 at 08:49
  • 3
    To answer my own question (figured it out): .stack() function – Vaidøtas I. Jan 23 '20 at 08:57
  • @Vaidøtas Ivøška, I have been struggling how to use it. Could you give an example? What if 'myword' is not in the column? It then raises a KeyError. – Newbielp Feb 21 '20 at 14:34
  • 2
    @Newbielp, I did this: df[[i for i in column_names]].astype('str').stack().value_counts().sum() which equates to setting each selected column to str type, stacking all individual columns on top, forming basically one column and then doing the value_counts() and sum() on that one column. :) Stack is pretty useful, it might not be the most obvious choice, but worked like a charm for my use-case :) – Vaidøtas I. Feb 21 '20 at 20:06
  • 2
    To add to @fantabolous' comment, use the `.get()` method if there is a possibility that zero values exist in the column. `.get()` will return `None` in that case whereas using the bracket method will throw an error. – elPastor Jan 18 '21 at 16:38
  • To limit the number of results to say "top 10" highest occurence use ```df['word'].value_counts().nlargest(10)``` – cristiandatum Jan 29 '21 at 07:31
  • How to get the counting occurrences of each column value when there is N number of column in a dataframe? – Shihab Ullah Jul 11 '22 at 00:31
32

When you want to count the frequency of categorical data in a column in pandas dataFrame use: df['Column_Name'].value_counts()

-Source.

Jeru Luke
  • 20,118
  • 13
  • 80
  • 87
25

Just an addition to the previous answers. Let's not forget that when dealing with real data there might be null values, so it's useful to also include those in the counting by using the option dropna=False (default is True)

An example:

>>> df['Embarked'].value_counts(dropna=False)
S      644
C      168
Q       77
NaN      2
user2314737
  • 27,088
  • 20
  • 102
  • 114
3

Other possible approaches to count occurrences could be to use (i) Counter from collections module, (ii) unique from numpy library and (iii) groupby + size in pandas.

To use collections.Counter:

from collections import Counter
out = pd.Series(Counter(df['word']))

To use numpy.unique:

import numpy as np
i, c = np.unique(df['word'], return_counts = True)
out = pd.Series(c, index = i)

To use groupby + size:

out = pd.Series(df.index, index=df['word']).groupby(level=0).size()

One very nice feature of value_counts that's missing in the above methods is that it sorts the counts. If having the counts sorted is absolutely necessary, then value_counts is the best method given its simplicity and performance (even though it still gets marginally outperformed by other methods especially for very large Series).

Benchmarks

(if having the counts sorted is not important):

If we look at runtimes, it depends on the data stored in the DataFrame columns/Series.

If the Series is dtype object, then the fastest method for very large Series is collections.Counter, but in general value_counts is very competitive.

enter image description here

However, if it is dtype int, then the fastest method is numpy.unique:

enter image description here

Code used to produce the plots:

import perfplot
import numpy as np
import pandas as pd
from collections import Counter

def creator(n, dt='obj'):
    s = pd.Series(np.random.randint(2*n, size=n))
    return s.astype(str) if dt=='obj' else s
    
def plot_perfplot(datatype):
    perfplot.show(
        setup = lambda n: creator(n, datatype),
        kernels = [lambda s: s.value_counts(),
                   lambda s: pd.Series(Counter(s)),
                   lambda s: pd.Series((ic := np.unique(s, return_counts=True))[1], index = ic[0]),
                   lambda s: pd.Series(s.index, index=s).groupby(level=0).size()
                  ],
        labels = ['value_counts', 'Counter', 'np_unique', 'groupby_size'],
        n_range = [2 ** k for k in range(5, 25)],
        equality_check = lambda *x: (d:= pd.concat(x, axis=1)).eq(d[0], axis=0).all().all(),
        xlabel = '~len(s)',
        title = f'dtype {datatype}'
    )
    
plot_perfplot('obj')
plot_perfplot('int')