9

Is it possible to directly compute the product (or for example sum) of two columns without using

grouped.apply(lambda x: (x.a*x.b).sum()

It is much (less than half the time on my machine) faster to use

df['helper'] = df.a*df.b
grouped= df.groupby(something)
grouped['helper'].sum()
df.drop('helper', axis=1)

But I don't really like having to do this. It is for example useful to compute the weighted average per group. Here the lambda approach would be

grouped.apply(lambda x: (x.a*x.b).sum()/(df.b).sum())

and again is much slower than dividing the helper by b.sum().

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Arthur G
  • 1,397
  • 1
  • 14
  • 23

3 Answers3

8

I want to eventually build an embedded array expression evaluator (Numexpr on steroids) to do things like this. Right now we're working with the limitations of Python-- if you implemented a Cython aggregator to do (x * y).sum() then it could be connected with groupby, but ideally you could write the Python expression as a function:

def weight_sum(x, y):
    return (x * y).sum()

and that would get "JIT-compiled" and be about as fast as groupby(...).sum(). What I'm describing is a pretty significant (many month) project. If there were a BSD-compatible APL implementation I might be able to do something like the above quite a bit sooner (just thinking out loud).

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
0

How about directly group the result of x.a*x.b, for example:

from pandas import *
from numpy.random import randn
df = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
               'foo', 'bar', 'foo', 'foo'],
        'B' : ['one', 'one', 'two', 'three',
               'two', 'two', 'one', 'three'],
        'C' : randn(8), 'D' : randn(8)})

print (df.C*df.D).groupby(df.A).sum()
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • This works of course. But I suspect that first the whole vector C*D is built in memory, then it is grouped and then summed. I wouldn't have to do this if I could efficiently walk through the rows, summing the c_i*d_i (or only building C*D group-wise and then sum them while walking through the groups). – Arthur G Apr 07 '12 at 14:37
0

The answer came many years later via pydata blaze

from blaze import *
data = Data(df)
somethings = odo(
by(data.something, 
   wm = (data.a * data.weights).sum()/data.weights.sum()),
pd.DataFrame)
tipanverella
  • 3,477
  • 3
  • 25
  • 41