I am noticing very slow performance when calling groupby and apply for a pandas dataframe (>100x slower than using pure python). My data is a series of nested lists of different lengths but fixed nesting depth, which I convert to a dataframe by adding columns for the list indices:
import pandas as pd
from random import randint
# original data
data1 = [[[[randint(0, 10) for i in range(randint(1, 3))] for i in range(randint(1, 5))] for i in range(500)] for i in range(3)]
# as a DataFrame
data2 = pd.DataFrame(
[
(i1, i2, i3, i4, x4)
for (i1, x1) in enumerate(data1)
for (i2, x2) in enumerate(x1)
for (i3, x3) in enumerate(x2)
for (i4, x4) in enumerate(x3)
],
columns = ['i1', 'i2', 'i3', 'i4', 'x']
)
# with indexing
data3 = data2.set_index(['i1', 'i2', 'i3']).sort_index()
Example data:
>>> data3
i4 x
i1 i2 i3
0 0 0 0 8
0 1 0
0 2 4
1 0 4
2 0 7
3 0 6
4 0 10
4 1 1
4 2 8
1 0 0 8
0 1 9
0 2 1
1 0 5
2 0 9
2 0 0 1
1 0 1
1 1 4
1 2 0
2 0 6
2 1 10
2 2 8
3 0 4
3 1 5
4 0 3
4 1 6
3 0 0 9
0 1 8
0 2 7
1 0 2
1 1 9
... .. ..
2 495 0 0 1
0 1 6
0 2 5
1 0 1
1 1 8
1 2 6
496 0 0 4
0 1 8
0 2 3
497 0 0 3
0 1 10
1 0 9
2 0 6
2 1 1
2 2 3
3 0 0
4 0 10
498 0 0 9
0 1 1
1 0 2
1 1 10
2 0 2
2 1 2
2 2 2
3 0 9
499 0 0 0
0 1 2
1 0 2
1 1 8
2 0 6
[8901 rows x 2 columns]
I want to apply a function on the inner most list. In the case below, the function operates on each row individually, but my real code would need to use the group as a whole, so groupby/apply is needed.
%timeit result1 = [[[[i4*x4 for (i4, x4) in enumerate(x3)] for x3 in x2] for x2 in x1] for x1 in data1]
# 100 loops, best of 3: 7.52 ms per loop
%timeit result2 = data2.groupby(['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 4.02 s per loop
%timeit result3 = data3.groupby(level = ['i1', 'i2', 'i3']).apply(lambda group: group['i4']*group['x'])
# 1 loop, best of 3: 8.86 s per loop
The code using pandas is orders of magnitude slower than working with the lists directly. Could someone point out what I am doing wrong? I am using pandas 0.18.1.