pandas dataframe groupby: sum/count of only positive numbers

Question

I have a dataframe ('frame') on which I want to aggregate by Country and Date:

aggregated=pd.DataFrame(frame.groupby(['Country','Date']).CaseID.count())

aggregated["Total duration"]=frame.groupby(['Country','Date']).Hours.sum()

aggregated["Mean duration"]=frame.groupby(['Country','Date']).Hours.mean()

I want to compute the above figures (total duration, mean duration, etc.) only for the positive 'Hours' numbers in 'frame'. How can I do that?

Thanks!

Sample "frame"

import pandas as pd
Line1 = {"Country": "USA", "Date":"01 jan", "Hours":4}
Line2 = {"Country": "USA", "Date":"01 jan", "Hours":3}
Line3 = {"Country": "USA", "Date":"01 jan", "Hours":-999}
Line4 = {"Country": "Japan", "Date":"01 jan", "Hours":3}
pd.DataFrame([Line1,Line2,Line3,Line4])

"frame" looks like this: Date, Country, Hours 01/01/2012, USA, 4 01/01/2012, USA, 3 01/01/2012, USA, -999 01/01/2012, Japan, 3 The output "aggregated" should look like this: Date, Country, Count, Count_positives, Total duration, Mean duration 01/01/2012, USA, 3,2,7,3.5 01/01/2012, Japan, 1,1,3,3 — Alexis Eggermont, Dec 06 '13 at 19:11
Note, ``provide a dataframe`` means some valid python code to rebuild it. — alko, Dec 06 '13 at 19:13
if there are no positive "Hours" for a specific date-country, it should be blank — Alexis Eggermont, Dec 06 '13 at 19:30
What do you mean by blank? Should there be a `Japan 01 jan 0` or `Japan 01 jan NaN` row, or should there be no Japan row at all? — DSM, Dec 06 '13 at 19:33

alko · Accepted Answer · 2013-12-06T19:40:05.767

9

Not as elegant as above, but deals differently some corner cases. df stands for frame from original question.

>>> df.groupby(['Country','Date']).agg(lambda x: x[x>0].mean())
                Hours
Country Date
Japan   01 jan    3.0
USA     01 jan    3.5
>>> df.ix[3, 'Hours'] = -1
>>> df.groupby(['Country','Date']).agg(lambda x: x[x>0].mean())
                Hours
Country Date
Japan   01 jan    NaN
USA     01 jan    3.5

edited Dec 06 '13 at 19:40

answered Dec 06 '13 at 19:34

alko

46,136
12
94
102

A better approach would be to just use `NaN` as the sentinel value instead of `-999`, and then do no filtering at all and use `nanmean` or other `nan`-insensitive stats functions that have implicit, faster filtering already within them. But I realize you are taking the data as a given from the OP's question. – ely Dec 06 '13 at 19:36
One reason to prefer doing the filtering before the groupby is if you're reusing (e.g. sum, count etc.), my guess is will be faster to reuse (though perhaps less clear). – Andy Hayden Dec 06 '13 at 20:06

score 8 · Answer 2 · edited Dec 06 '13 at 19:29

8

How about -

frame[frame["Hours"] > 0].groupby(['Country','Date'])

edited Dec 06 '13 at 19:29

alko

46,136
12
94
102

answered Dec 06 '13 at 19:17

kgu87

2,050
14
12

pandas dataframe groupby: sum/count of only positive numbers

2 Answers2

Linked