0

I'm using pandas and noticed a HUGE difference in performance between these two statements:

df.sum(level=['My', 'Index', 'Levels']) # uses numpy sum which is vectorized

and

df.groupby(level=['My', 'Index', 'Levels']).sum() # Slow...

First example is using the numpy.sum, which is vectorized, as stated in the documentation.

Unfortunably, using sum(level=...) is deprecated in the API and produces an ugly warning:

FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum()

I don't want to use the non vectorized version and have a poor processing performance. How can I use numpy.sum along with groupby ?

Edit: following the comments, here is a basic test I have done: Pandas 1.4.4 , 10k random lines, 10 levels (index)

import pandas as pd
import numpy as np
print('pandas:', pd.__version__)

nb_lines = int(1e4)
nb_levels = 10
# sequences of random integers [0, 9] x 10k
ix = np.random.randint(0, nb_levels-1, size=(nb_lines, nb_levels))

cols = [chr(65+i) for i in range(nb_levels)] # A, B, C, ...
df = pd.DataFrame(ix, columns=cols)
df = df.set_index(cols)
df['VALUE'] = np.random.rand(nb_lines) # random values to aggregate

print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)

And the result is:

pandas: 1.4.4
with groupby:
5.51 ms ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
4.93 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)

This is just an example, but the result is always faster without groupby. Changing parameters (levels, step size for the columns to group on, etc) does not change the result. In the end, for a big data set, you can see the difference between the two methods (numpy.sum vs other).

@mozway, you results indicate a similar performance, however if you increase the number of levels, you should see results worsening with the groupby version -at least that's the results on my computer. See edited code so you can change the number of levels (example with 10 levels and 100k lines):

import numpy as np
from string import ascii_uppercase as UP

np.random.seed(0)
N = 100_000
nb_levels = 10
cols = [chr(65+i) for i in range(nb_levels)]

d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_levels)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)

print(pd.__version__)

print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)

... and the result:

1.4.4
with groupby:
50.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
42 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)

Thanks

Carmellose
  • 4,815
  • 10
  • 38
  • 56
  • slow in what sense? under the hood the sum with level does a groupby – sammywemmy Oct 26 '22 at 20:19
  • @sammywemmy sum with level uses numpy.sum which is vectorized, whereas the explicit groupby with sum does not – Carmellose Oct 26 '22 at 20:26
  • 1
    that is incorrect. groupby's sum is vectorised. Have a look at the docs and even the underlying code ... it is open source so you can verify that easily – sammywemmy Oct 26 '22 at 20:31
  • 2
    I suggest you add a sample dataset to your question, see here how to do it: https://stackoverflow.com/a/30424537/463796 - then an objective measurement can be taken. – w-m Oct 26 '22 at 20:32
  • What huge performance difference? Tested with 1m rows. Avg. 306ms with `sum(level=...)` and 303ms with `groupby(...).sum()`. I'd call them equal – Code Different Oct 26 '22 at 20:46
  • @w-m I have added I sample test Thanks – Carmellose Oct 26 '22 at 21:14

1 Answers1

2

This doesn't seem to be true, both approaches have a similar speed.

Setup (3 levels, 26 groups each, ~18k combinations of groups, 1M rows):

import numpy as np
from string import ascii_uppercase as UP

np.random.seed(0)
N = 1_000_000

cols = ['A', 'B', 'C']
df = pd.DataFrame({'A': np.random.choice(list(UP), size=N),
                   'B': np.random.choice(list(UP), size=N),
                   'C': np.random.choice(list(UP), size=N),
                   'num': np.random.random(size=N),}
                  ).set_index(cols)

Test:

pd.__version__
1.4.4

%%timeit # 3 times
df.sum(level=cols)
316 ms ± 85.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
287 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
297 ms ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%%timeit # 3 times
df.groupby(level=cols).sum()
284 ms ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
286 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
311 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

udpated example from OP

import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
nb_cols = 10 
cols = [chr(65+i) for i in range(nb_cols)] 
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_cols)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)

print(pd.__version__)
1.5.0


%%timeit
df.sum(level=cols)
3.36 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df.groupby(level=cols).sum()
2.94 s ± 444 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Actually your test has only three levels, when you increase them the performance worsen and matches my results. Numpy.sum outperforms the groupby. See code below. – Carmellose Oct 27 '22 at 08:21
  • Can you edit your question instead? – mozway Oct 27 '22 at 08:24
  • @Carmellose still get similar results (or `df.sum` slower), not sure if this is what owed me a downvote, I'm just trying to reproduce your potential issue… – mozway Oct 27 '22 at 08:34
  • what about version 1.4.4? Are the results the same than with 1.5.0? – Carmellose Oct 27 '22 at 19:08
  • 1
    @Carmellose maybe a bit slower for groupby. Not sure if this is significant. There could also have been improvements between the two versions. Why don't you try your code on 1.5.0? – mozway Oct 27 '22 at 19:16
  • Thanks. Not using 1.5.0 because my stack is constrained by a production environment and don't have access to it :( – Carmellose Oct 28 '22 at 14:34