2

Question

Looking for something like

df.groubpy('key').aggregate(combination(columnA, columnB))

instead of

df['combination'] = combination(columnA, columnB)
df.groupby('key')['combination'].aggregate()

The only requirement is that the combination of columns is calculated after the groupby.


Description

I seems natural, logically wise, for some cases to first groupby and then aggregate.

One example would be different aggregate functions for different combinations of columns that use the same groups.

Looking for

  1. groupby
  2. choose combination of columns
  3. use the corresponding aggregate function

instead of

  1. create all the necessary columns (for every aggregate function)
  2. groupby (for every aggregate function)
  3. apply specific aggregate function

Example

key     ColumnA  ColumnB
key1       1        1
key2       2        2
key1       3        3
key3       4        4
key2       5        5

#can do
df['combination'] = df.columnA * df.columnB
df.groupby('key').mean()

#looking for
grouped = df.groupby('key')
grouped.mean(columnA * columnB)
Community
  • 1
  • 1
giannisl9
  • 151
  • 2
  • 11
  • 1
    I’m not sure I understand the difference. They should give equivalent answers, regardless of whether you multiply before or within the group. The only major difference is that the latter is horribly inefficient because it falls back to a slow python loop, the former has a fast cython implementation – ALollz Feb 02 '20 at 17:48
  • @anky_91 It is a function of two columns which gives a single output in the case I am aware of. For the answer, it can be substituted. – giannisl9 Feb 02 '20 at 18:00
  • @ALollz Yes, they are equivalent. The difference is in the way you express it and use it. I am looking if something exists in pandas. – giannisl9 Feb 02 '20 at 18:02

1 Answers1

2

Ok, so I think the answer you are looking for is - we don't do that, because of vectorization in python

Consider the below code.

Now in essence - python is often optimized to execute certain mathematical operations in a vectorized way (take numpy or pandas for instance) - which means - applying it to the whole vector is faster, than breaking it down into chunks, and executing it then.

So e.g. df["A"].mul(df["B"]) will be faster than: df.apply(lambda X: X["A"]*X["B"], axis=0). Same goes for grouping - it's just way more scalable that way.

Try the below code - it's in essence what you were referring to - so doing operation before Vs after groupby(...). The vectorized solution scales up pretty fast, even though you materialize additional column - the more rows you process the bigger difference you will see.

Edit

I added vectorized solution on grouped data, so we have:

(1) we group, we evaluate lazily line by line

(2) we process full df in a vectorized way, we group we apply built-in aggregating function

(3) we group, we process in a vectorized way group, by group, we do aggregating function

in essence - from the result we see breaking down into chunks slows down the processing, regardless whether it's per groups, or per record - so vectorized solution scales better than any kind of custom solution that we can apply on top.

import pandas as pd
import numpy as np
import time

x=np.random.randint(1,9,(3000,5))
df=pd.DataFrame(x, columns=[f"column{l}" for l in list("ABCDE")])
df["cat"]=np.random.choice([f"key{l}" for l in list("ABCDEFG")], size=3000)
df2=df3=df
#print(df)
s=time.time()
df.groupby("cat").apply(lambda z: np.prod(z.values, axis=1).mean()).pipe(print)
e=time.time()-s
print(f"method 1: {e} s")

s=time.time()
df2["prod"]=np.prod(df[[f"column{l}" for l in list("ABCDE")]], axis=1)
df2.groupby("cat")["prod"].mean().pipe(print)
e=time.time()-s
print(f"method 2: {e} s")

s=time.time()
df3=list(map(lambda x: (x[0], np.prod(x[1][[f"column{l}" for l in list("ABCDE")]], axis=1).mean()), df3.groupby("cat")))
print(df3)
e=time.time()-s
print(f"method 3: {e} s")
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • 1
    Great answer! Thank you very much. It seemed to me logical that one could achieve the same without using something as slow as apply. As a matter of fact, it is only selecting some columns and an operation between them in a groupby object. – giannisl9 Feb 02 '20 at 18:21
  • 1
    I mean that after the split operation, combining the columns could also use a vectorized way. But it seems there is no way to do that? – giannisl9 Feb 02 '20 at 18:31
  • I added this concept- please correct me, if you think this could be done better - that would be my take on breaking down into chunks and processing them one by one in a vectorized way. – Grzegorz Skibinski Feb 02 '20 at 19:44