2

I have the following df and I'd like to group it by Date & Ref but with sum conditions.

In this respect I'd need to group by Date & Ref and sum 'Q' column only if P is >= than PP.

df = DataFrame({'Date' : ['1', '1', '1', '1'],
                'Ref' : ['one', 'one', 'two', 'two'],
                'P' : ['50', '65', '30', '38'],
                'PP' : ['63', '63', '32', '32'],
                'Q' : ['10', '15', '20', '10']})

df.groupby(['Date','Ref'])['Q'].sum() #This does the right grouping byt summing the whole column
df.loc[df['P'] >= df['PP'], ('Q')].sum() #this has the right sum condition, but does not divide between Date & Ref

Is there a way to do that? Many thanks in advance

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
Rose
  • 203
  • 2
  • 10

2 Answers2

5

Just filter prior to grouping:

In[15]:
df[df['P'] >= df['PP']].groupby(['Date','Ref'])['Q'].sum()

Out[15]: 
Date  Ref
1     one    15
      two    10
Name: Q, dtype: object

This reduces the size of the df in the first place so will speed up the groupby operation

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Many thanks Ed. And for dividing this grouping by the total Q (no conditional)? Is there an easy way of doing it? – Rose Jan 24 '19 at 16:08
  • If you have a new question then you should post a new question, answering new questions in comments is counter-productive. Besides are you wanting to applying multiple functions to a `groupby`? You may want to read the [docs](http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once) – EdChum Jan 24 '19 at 16:15
2

You could do:

import pandas as pd

df = pd.DataFrame({'Date' : ['1', '1', '1', '1'],
                'Ref' : ['one', 'one', 'two', 'two'],
                'P' : ['50', '65', '30', '38'],
                'PP' : ['63', '63', '32', '32'],
                'Q' : ['10', '15', '20', '10']})

def conditional_sum(x):
    return x[x['P'] >= x['PP']].Q.sum()

result = df.groupby(['Date','Ref']).apply(conditional_sum)

print(result)

Output

Date  Ref
1     one    15
      two    10
dtype: object

UPDATE

If you want to sum multiple columns in the output, you could use loc:

def conditional_sum(x):
    return x.loc[x['P'] >= x['PP'], ['Q', 'P']].sum()


result = df.groupby(['Date', 'Ref']).apply(conditional_sum)

print(result)

Output

             Q     P
Date Ref            
1    one  15.0  65.0
     two  10.0  38.0

Note that in the example above I used column P for the sake of showing how to do it with multiple columns.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76