1

I would like to compute the mean per ID using groupby and mean. However, I only need the rows where Date is between year 2016-01-01 and 2017-12-31.

d = {'ID': ['STCK123', 'STCK123', 'STCK123'], 'Amount': [250, 400, 350], 
     'Date': ['2016-01-20', '2017-09-25', '2018-05-15']}



data = pd.DataFrame(data=d)
data = data[['ID', 'Amount', 'Date']]

data['Date'] = pd.to_datetime(data['Date'])

This gives following df:

    ID  Amount    Date
STCK123 250 2016-01-20
STCK123 400 2017-09-25
STCK123 350 2018-05-15

When I use:

data.groupby(['ID'])['Amount'].agg('mean')

It takes all rows into account, resulting in a mean value of 333.3. How can I exclude the rows where Date is 2018 (yielding a mean value of (250+400)/2=325)?

nick
  • 1,090
  • 1
  • 11
  • 24
apples-oranges
  • 959
  • 2
  • 9
  • 21

1 Answers1

1

You'll need a pre-filtering step with query:

df.query('Date.dt.year != 2018').groupby('ID').mean()

         Amount
ID             
STCK123     325

More uses for eval, query, and associated parameters can be found here in my writeup: Dynamic Expression Evaluation in pandas using pd.eval()

See here for more methods on dropping rows before calling groupby.


You can also mask those rows, without having to drop them. NaNs are excluded from the GroupBy aggregation.

df.mask(df.Date.dt.year == 2018).groupby('ID').mean()

         Amount
ID             
STCK123   325.0 
cs95
  • 379,657
  • 97
  • 704
  • 746