Slow performance of pandas groupby/apply

Question

I am noticing very slow performance when calling groupby and apply for a pandas dataframe (>100x slower than using pure python). My data is a series of nested lists of different lengths but fixed nesting depth, which I convert to a dataframe by adding columns for the list indices:

import pandas as pd
from random import randint

# original data
data1   = [[[[randint(0, 10) for i in range(randint(1, 3))] for i in range(randint(1, 5))] for i in range(500)] for i in range(3)]
# as a DataFrame
data2   = pd.DataFrame(
    [
        (i1, i2, i3, i4, x4)
        for (i1, x1) in enumerate(data1)
        for (i2, x2) in enumerate(x1)
        for (i3, x3) in enumerate(x2)
        for (i4, x4) in enumerate(x3)
    ],
    columns = ['i1', 'i2', 'i3', 'i4', 'x']
)
# with indexing
data3   = data2.set_index(['i1', 'i2', 'i3']).sort_index()

Example data:

>>> data3
           i4   x
i1 i2  i3        
0  0   0    0   8
       0    1   0
       0    2   4
       1    0   4
       2    0   7
       3    0   6
       4    0  10
       4    1   1
       4    2   8
   1   0    0   8
       0    1   9
       0    2   1
       1    0   5
       2    0   9
   2   0    0   1
       1    0   1
       1    1   4
       1    2   0
       2    0   6
       2    1  10
       2    2   8
       3    0   4
       3    1   5
       4    0   3
       4    1   6
   3   0    0   9
       0    1   8
       0    2   7
       1    0   2
       1    1   9
...        ..  ..
2  495 0    0   1
       0    1   6
       0    2   5
       1    0   1
       1    1   8
       1    2   6
   496 0    0   4
       0    1   8
       0    2   3
   497 0    0   3
       0    1  10
       1    0   9
       2    0   6
       2    1   1
       2    2   3
       3    0   0
       4    0  10
   498 0    0   9
       0    1   1
       1    0   2
       1    1  10
       2    0   2
       2    1   2
       2    2   2
       3    0   9
   499 0    0   0
       0    1   2
       1    0   2
       1    1   8
       2    0   6

[8901 rows x 2 columns]

I want to apply a function on the inner most list. In the case below, the function operates on each row individually, but my real code would need to use the group as a whole, so groupby/apply is needed.

%timeit result1 = [[[[i4*x4 for (i4, x4) in enumerate(x3)] for x3 in x2] for x2 in x1] for x1 in data1]
# 100 loops, best of 3: 7.52 ms per loop
%timeit result2 = data2.groupby(['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 4.02 s per loop
%timeit result3 = data3.groupby(level = ['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 8.86 s per loop

The code using pandas is orders of magnitude slower than working with the lists directly. Could someone point out what I am doing wrong? I am using pandas 0.18.1.

score 1 · Answer 1 · answered Jan 04 '17 at 20:30

apply is a method of last resort and is very slow as it passes the entire dataframe group for each iteration to your custom function. In your specific case, you have no need for apply as you are simply multiplying two columns together. The grouping has no effect here. Try to use vectorized functions first if you can or agg and then transform when grouping.

You can simply do data2['i4'] * data2['x'] instead of your groupby and apply.

%timeit result1 = [[[[i4*x4 for (i4, x4) in enumerate(x3)] for x3 in x2] for x2 in x1] for x1 in data1]
# 100 loops, best of 3: 4.51 ms per loop
%timeit result2 = data2.groupby(['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 1.69 s per loop
%timeit result3 = data3.groupby(level = ['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 3.31 s per loop
%timeit data2['i4'] * data2['x']
10000 loops, best of 3: 122 µs per loop

My actual function is more complicated and does need to act on the group as a whole. I expect some overhead from that, but >1000x for case 3 vs 1? The example data only has 4500 groups, and making 4500 copies of the _entire_ dataset is 4x faster than case 3 — Alec, Jan 04 '17 at 20:59
Yes, 'apply' can be extremely, extremely slow. I've had an apply take 8 hours once. You need to try and vectorize as many operations as you can before you use apply. — Ted Petrou, Jan 04 '17 at 21:01

Slow performance of pandas groupby/apply

1 Answers1

Linked