0

I have a file of half-hourly data which I wish to group together by hour. This works:

data.groupby(pd.Grouper(freq='1h')).agg('sum')

However, it fills in hours where there is no half-hourly data.

How can I perform a grouping like this but not create records where there was no data in the input DataFrame? e.g. From:

07:00
07:30
09:00
09:30

I want to create:

07:00
09:00

Not:

07:00
08:00
09:00

Thanks

Julian7
  • 191
  • 1
  • 12
  • 1
    You can always filter out rows after getting your result: `result = result[result.index.isin(data.index)]` Don't think there's a way to do that with `groupby` directly. – cs95 Jul 09 '20 at 19:38
  • you can round down to the hour then drop duplicates keeping the first value, can you post some sample of your data? – Umar.H Jul 09 '20 at 19:38

1 Answers1

1

If you just want to filter out the added times, you can do what cs95 said in the comments or:

out = data.groupby(pd.Grouper(freq='1h')).sum(min_count=1).dropna()

The min_count makes NaN be the output if there is no data for the bin, which can then be removed with dropna().

If you instead don't want those extra bins to be computed in the first place, this can be more complicated (Note that there is a similar open post on this, also from today). But given an hourly bin frequency, you can do something like this:

out1 = data.groupby(data.index.hour).sum()

And if the data span multiple days, you could do:

out2 = data.groupby([data.index.date, data.index.hour]).sum()

But note here that the data index is out of datetime format now, so you might need to convert back.

Here's the example data I used:

import pandas as pd

dr = pd.date_range('1-1-2020 7:00', periods=6, freq='30min')
data = pd.DataFrame([10,20,30,40,50,60], index=dr, columns=['Values'])
data = data[data.index.hour != 8]
Tom
  • 8,310
  • 2
  • 16
  • 36