12

I have a dataset with name (person_name), day and color (shirt_color) as columns.

Each person wears a shirt with a certain color on a particular day. The number of days can be arbitrary.

E.g. input:

name  day  color
----------------
John   1   White
John   2   White
John   3   Blue
John   4   Blue
John   5   White
Tom    2   White
Tom    3   Blue
Tom    4   Blue
Tom    5   Black
Jerry  1   Black
Jerry  2   Black
Jerry  4   Black
Jerry  5   White

I need to find the most frequently used color by each person.

E.g. result:

name    color
-------------
Jerry   Black
John    White
Tom     Blue

I am performing the following operation to get the results, which works fine but is quite slow:

most_frquent_list = [[name, group.color.mode()[0]] 
                        for name, group in data.groupby('name')]
most_frquent_df = pd.DataFrame(most_frquent_list, columns=['name', 'color'])

Now suppose I have a dataset with 5 million unique names. What is the best/fastest way to perform the above operation?

Michael0x2a
  • 58,192
  • 30
  • 175
  • 224
astrobiologist
  • 183
  • 1
  • 1
  • 7

7 Answers7

17

Numpy's numpy.add.at and pandas.factorize

This is intended to be fast. However, I tried to organize it to be readable as well.

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)

b = np.zeros((n, m), dtype=np.int64)

np.add.at(b, (i, j), 1)
pd.Series(c[b.argmax(1)], r)

John     White
Tom       Blue
Jerry    Black
dtype: object

groupby, size, and idxmax

df.groupby(['name', 'color']).size().unstack().idxmax(1)

name
Jerry    Black
John     White
Tom       Blue
dtype: object

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object

Counter

¯\_(ツ)_/¯

from collections import Counter

df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0])

name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 6
    1st: 362 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) – DYZ Aug 22 '18 at 22:24
  • 4
    2nd: 1.51 ms ± 4.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) – DYZ Aug 22 '18 at 22:25
  • 4
    3rd: 834 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) – DYZ Aug 22 '18 at 22:26
6

UPDATE

It must be hard to beat this (~10 times faster on the sample daraframe than any proposed pandas solution and 1.5 faster than the proposed numpy solution). The gist is to stay away from pandas and use itertools.groupby which is doing a much better job when it concerns non-numerical data.

from itertools import groupby
from collections import Counter

pd.Series({x: Counter(z[-1] for z in y).most_common(1)[0][0] for x,y 
          in groupby(sorted(df.values.tolist()), 
                            key=lambda x: x[0])})
# Jerry    Black
# John     White
# Tom       Blue

Old Answer

Here's another method. It is actually slower than the original one, but I'll keep it here:

data.groupby('name')['color']\
    .apply(pd.Series.value_counts)\
    .unstack().idxmax(axis=1)
# name
# Jerry    Black
# John     White
# Tom       Blue
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • Hah! I just did that. I'll delete – piRSquared Aug 22 '18 at 22:12
  • @piRSquared C'mon, keep it! Let the OP decide. – DYZ Aug 22 '18 at 22:13
  • @piRSquared Yours `Counter` was still on the slower side because of `apply`. The point here is not to mess up with pandas. – DYZ Aug 22 '18 at 22:14
  • I think the `collections` solution is fine, but calling it 10x faster than pandas/numpy is misleading. On a dataframe with even just a couple hundred rows, piRSquared's factorize solution beats it easily, and timings on a sample dataframe never mean much – user3483203 Aug 22 '18 at 22:16
  • @user3483203 Agree. I added a note that the 10x speedup is seen only on the sample dataframe. – DYZ Aug 22 '18 at 22:18
  • 1st: 252 µs ± 1.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) – DYZ Aug 22 '18 at 22:26
  • 2nd: 3.65 ms ± 26.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) – DYZ Aug 22 '18 at 22:27
5

Solution from pd.Series.mode

df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True)
Out[281]: 
name
Jerry    Black
John     White
Tom       Blue
Name: color, dtype: object
BENY
  • 317,841
  • 20
  • 164
  • 234
2

How about doing two groupings with transform(max)?

df = df.groupby(["name", "color"], as_index=False, sort=False).count()
idx = df.groupby("name", sort=False).transform(max)["day"] == df["day"]
df = df[idx][["name", "color"]].reset_index(drop=True)

Output:

    name  color
0   John  White
1    Tom   Blue
2  Jerry  Black
André C. Andersen
  • 8,955
  • 3
  • 53
  • 79
  • 12.2 ms ± 48.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) – DYZ Aug 22 '18 at 22:21
  • 1
    Thank you. Not very good then, from what I see the others are getting. As feedback on time testing, if you're looping over the same small dataset many times it might not be comparable to doing it once on a large dataset. Many solutions like this have high startup cost, but will perform well as soon as it starts processing. Looping over the a small dataset many times means you might just be measuring the startup cost, which should be just a one time cost. I suggest you increase the size of the data set you are testing until it takes a few seconds to run one loop. – André C. Andersen Aug 23 '18 at 09:14
1

Similar to @piRSquared's pd.factorize and np.add.at ans.

We encode the stings in columns using

i, r = pd.factorize(df.name)
j, c = pd.factorize(df.color)
n, m = len(r), len(c)
b = np.zeros((n, m), dtype=np.int64)

But then, instead of doing this:

np.add.at(b, (i, j), 1)
max_columns_after_add_at = b.argmax(1)

We get the max_columns_after_add_at using a jited function, to do add at and find maximum in the same loop:

@nb.jit(nopython=True, cache=True)
def add_at(x, rows, cols, val):
    max_vals = np.zeros((x.shape[0], ), np.int64)
    max_inds = np.zeros((x.shape[0], ), np.int64)
    for i in range(len(rows)):
        r = rows[i]
        c = cols[i]
        x[r, c]+=1
        if(x[r, c] > max_vals[r]):
            max_vals[r] = x[r, c]
            max_inds[r] = c
    return max_inds

And then get the dataframe in the end,

ans = pd.Series(c[max_columns_after_add_at], r)

So, the difference is how we do argmax(axis=1) after np.add.at().

Timing analysis

import numpy as np
import numba as nb
m = 100000
n = 100000
rows = np.random.randint(low = 0, high = m, size=10000)
cols = np.random.randint(low = 0, high = n, size=10000)

So this:

%%time
x = np.zeros((m,n))
np.add.at(x, (rows, cols), 1)
maxs = x.argmax(1)

gives:

CPU times: user 12.4 s, sys: 38 s, total: 50.4 s Wall time: 50.5 s

And this

%%time
x = np.zeros((m,n))
maxs2 = add_at(x, rows, cols, 1)

gives

CPU times: user 108 ms, sys: 39.4 s, total: 39.5 s Wall time: 38.4 s

Deepak Saini
  • 2,810
  • 1
  • 19
  • 26
1

Most of the test results discussed in other answers are skewed due to measuring using a trivially small test DataFrame as input. Pandas has some fixed but generally negligible setup time, but it will appear significant next to processing this tiny dataset.

On a larger dataset, the fastest method is using pd.Series.mode() with agg():

df.groupby('name')['color'].agg(pd.Series.mode)

Test bench:

arr = np.array([
    ('John',   1,   'White'),
    ('John',   2,  'White'),
    ('John',   3,   'Blue'),
    ('John',   4,   'Blue'),
    ('John',   5,   'White'),
    ('Tom',    2,   'White'),
    ('Tom',    3,   'Blue'),
    ('Tom',    4,   'Blue'),
    ('Tom',    5,   'Black'),
    ('Jerry',  1,   'Black'),
    ('Jerry',  2,   'Black'),
    ('Jerry',  4,   'Black'),
    ('Jerry',  5,   'White')],
    dtype=[('name', 'O'), ('day', 'i8'), ('color', 'O')])

from timeit import Timer
from itertools import groupby
from collections import Counter

df = pd.DataFrame.from_records(arr).sample(100_000, replace=True)

def factorize():
    i, r = pd.factorize(df.name)
    j, c = pd.factorize(df.color)
    n, m = len(r), len(c)

    b = np.zeros((n, m), dtype=np.int64)

    np.add.at(b, (i, j), 1)
    return pd.Series(c[b.argmax(1)], r)

t_factorize = Timer(lambda: factorize())
t_idxmax = Timer(lambda: df.groupby(['name', 'color']).size().unstack().idxmax(1))
t_aggmode = Timer(lambda: df.groupby('name')['color'].agg(pd.Series.mode))
t_applymode = Timer(lambda: df.groupby('name').color.apply(pd.Series.mode).reset_index(level=1,drop=True))
t_aggcounter = Timer(lambda: df.groupby('name')['color'].agg(lambda c: Counter(c).most_common(1)[0][0]))
t_applycounter = Timer(lambda: df.groupby('name').color.apply(lambda c: Counter(c).most_common(1)[0][0]))
t_itertools = Timer(lambda: pd.Series(
    {x: Counter(z[-1] for z in y).most_common(1)[0][0] for x,y
      in groupby(sorted(df.values.tolist()), key=lambda x: x[0])}))

n = 100
[print(r) for r in (
    f"{t_factorize.timeit(number=n)=}",
    f"{t_idxmax.timeit(number=n)=}",
    f"{t_aggmode.timeit(number=n)=}",
    f"{t_applymode.timeit(number=n)=}",
    f"{t_applycounter.timeit(number=n)=}",
    f"{t_aggcounter.timeit(number=n)=}",
    f"{t_itertools.timeit(number=n)=}",
)]
t_factorize.timeit(number=n)=1.325189442
t_idxmax.timeit(number=n)=1.0613339019999999
t_aggmode.timeit(number=n)=1.0495010750000002
t_applymode.timeit(number=n)=1.2837302849999999
t_applycounter.timeit(number=n)=1.9432825890000007
t_aggcounter.timeit(number=n)=1.8283823839999993
t_itertools.timeit(number=n)=7.0855046380000015
wst
  • 11,681
  • 1
  • 24
  • 39
0

For those who want to convert the above table into a data frame and try the posted answers, you can use this snippet. Copy paste the table above in the notebook cell like given below, make sure to remove the hyphens

l = """name  day  color
John   1   White
John   2   White
John   3   Blue
John   4   Blue
John   5   White
Tom    2   White
Tom    3   Blue
Tom    4   Blue
Tom    5   Black
Jerry  1   Black
Jerry  2   Black
Jerry  4   Black
Jerry  5   White""".split('\n')

Now we need to convert this list into a list of tuples.

df = pd.DataFrame([tuple(i.split()) for i in l])
headers = df.iloc[0]
new_df  = pd.DataFrame(df.values[1:], columns=headers)

Use new_df now and you can refer the answers above by @piRSquared