There are many questions (1, 2, 3) dealing with counting values in a single series.
However, there are fewer questions looking at the best way to count combinations of two or more series. Solutions are presented (1, 2), but when and why one should use each is not discussed.
Below is some benchmarking for three potential methods. I have two specific questions:
- Why is
grouper
more efficient thancount
? I expectedcount
to be the more efficient, as it is implemented in C. The superior performance ofgrouper
persists even if number of columns is increased from 2 to 4. - Why does
value_counter
underperformgrouper
by so much? Is this due to the cost of constructing a list, or series from list?
I understand the outputs are different, and this should also inform choice. For example, filtering by count is more efficient with contiguous numpy
arrays versus a dictionary comprehension:
x, z = grouper(df), count(df)
%timeit x[x.values > 10] # 749µs
%timeit {k: v for k, v in z.items() if v > 10} # 9.37ms
However, the focus of my question is on performance of building comparable results in a series versus dictionary. My C knowledge is limited, yet I would appreciate any answer which can point to the logic underlying these methods.
Benchmarking code
import pandas as pd
import numpy as np
from collections import Counter
np.random.seed(0)
m, n = 1000, 100000
df = pd.DataFrame({'A': np.random.randint(0, m, n),
'B': np.random.randint(0, m, n)})
def grouper(df):
return df.groupby(['A', 'B'], sort=False).size()
def value_counter(df):
return pd.Series(list(zip(df.A, df.B))).value_counts(sort=False)
def count(df):
return Counter(zip(df.A.values, df.B.values))
x = value_counter(df).to_dict()
y = grouper(df).to_dict()
z = count(df)
assert (x == y) & (y == z), "Dictionary mismatch!"
for m, n in [(100, 10000), (1000, 10000), (100, 100000), (1000, 100000)]:
df = pd.DataFrame({'A': np.random.randint(0, m, n),
'B': np.random.randint(0, m, n)})
print(m, n)
%timeit grouper(df)
%timeit value_counter(df)
%timeit count(df)
Benchmarking results
Run on python 3.6.2, pandas 0.20.3, numpy 1.13.1
Machine specs: Windows 7 64-bit, Dual-Core 2.5 GHz, 4GB RAM.
Key: g = grouper
, v = value_counter
, c = count
.
m n g v c
100 10000 2.91 18.30 8.41
1000 10000 4.10 27.20 6.98[1]
100 100000 17.90 130.00 84.50
1000 100000 43.90 309.00 93.50
1 This is not a typo.