0

I have a pandas.Dataframe with a DatetimeIndex. The data are sampled every 30 minutes (can be different). I want to resample the data every hour and, for each group, I want to extract the row with the highest value on a specific column.

E.g.

                     WindDir  WindSpeed  Temperature  CloudHgt
Date                                                          
2020-01-01 00:20:00    150.0        3.6          5.0     213.0
2020-01-01 00:50:00    150.0        3.1          5.0     213.0
2020-01-01 01:20:00    130.0        4.1          5.0     183.0
2020-01-01 01:50:00    130.0        3.6          5.0     183.0
2020-01-01 02:20:00    130.0        3.1          4.0     244.0
2020-01-01 02:50:00    140.0        3.6          4.0     366.0
2020-01-01 03:20:00    130.0        2.6          3.0   99999.0
2020-01-01 03:50:00    160.0        2.6          3.0      91.0
2020-01-01 04:20:00    170.0        2.6          3.0      61.0
2020-01-01 04:50:00    180.0        2.6          3.0      30.0
2020-01-01 05:20:00    150.0        1.5          3.0     183.0
2020-01-01 05:50:00    190.0        2.1          3.0      30.0
2020-01-01 06:20:00    190.0        2.1          3.0     122.0
2020-01-01 06:50:00    220.0        1.5          3.0     152.0
2020-01-01 07:20:00    230.0        2.1          3.0     213.0
2020-01-01 07:50:00    230.0        1.5          3.0     244.0
2020-01-01 08:20:00    999.0        1.0          3.0     335.0
2020-01-01 08:50:00    999.0        0.5          3.0     335.0
2020-01-01 09:20:00    120.0        2.1          3.0     335.0

And I want to resample this obtaining:

                     WindDir  WindSpeed  Temperature  CloudHgt
Date                                                          
2020-01-01 00:00:00    150.0        3.6          5.0     213.0
2020-01-01 01:00:00    130.0        4.1          5.0     183.0
2020-01-01 02:00:00    130.0        3.6          5.0     183.0
2020-01-01 03:00:00    140.0        3.6          4.0     366.0
2020-01-01 04:00:00    160.0        2.6          3.0      91.0
2020-01-01 05:00:00    180.0        2.6          3.0      30.0
2020-01-01 06:00:00    190.0        2.1          3.0      30.0
2020-01-01 07:00:00    230.0        2.1          3.0     213.0
2020-01-01 08:00:00    230.0        1.5          3.0     244.0
2020-01-01 09:00:00    120.0        2.1          3.0     335.0

I read several questions on SO such as this, this, this, this and this. Hoever most of these focus on Dataframes without DatetimeIndex for which the order of the samples does not matter. Also, since Groupby re-labels the group with the start-time of that group, looking for samples with the same index is unfeasible.

I found a working solution using apply(lambda group: group.nlargest(1, columns=column)). However this is extremely slow on large Dataframes.

At the moment my code looks something like:

# define grouper
grouper = pd.Grouper(freq='1H', offset='30min')

# aggregate data (except wind dir)
resampled = (data
             .groupby(grouper)
             .apply(lambda group: group.nlargest(1, columns=column))
             .reset_index(level=-1, drop=True)
             ).shift(30,'min')

Is there a better solution that does not uses the apply function as it is extremely slow?

Luca
  • 1,610
  • 1
  • 19
  • 30

1 Answers1

0
df.groupby(df.index.ceil(freq="H")).apply(lambda x: x.loc[x.WindDir.idxmax(), :])
#                      WindDir  WindSpeed  Temperature  CloudHgt
# Date                                                          
# 2020-01-01 01:00:00    150.0        3.6          5.0     213.0
# 2020-01-01 02:00:00    130.0        4.1          5.0     183.0
# 2020-01-01 03:00:00    140.0        3.6          4.0     366.0
# 2020-01-01 04:00:00    160.0        2.6          3.0      91.0
# 2020-01-01 05:00:00    180.0        2.6          3.0      30.0
# 2020-01-01 06:00:00    190.0        2.1          3.0      30.0
# 2020-01-01 07:00:00    220.0        1.5          3.0     152.0
# 2020-01-01 08:00:00    230.0        2.1          3.0     213.0
# 2020-01-01 09:00:00    999.0        1.0          3.0     335.0
# 2020-01-01 10:00:00    120.0        2.1          3.0     335.0
d.b
  • 32,245
  • 6
  • 36
  • 77
  • The output doesn't look exactly the OP's expected output - your dates start with 01:00:00 and end with 10:00:00 - the OP's start with 00:00:00 and end with 09:00:00. –  Nov 10 '21 at 17:55
  • 1
    @user17242583, OP might need to use `floor` instead of `ceil` – d.b Nov 10 '21 at 17:59
  • On a different note, what purpose does the `:` after the `idxmax()` serve? When I remove it, it doesn't change the output. –  Nov 10 '21 at 18:04
  • 1
    @user17242583, the `:` is for retrieving all columns. removing it should not affect the output but it's just my habit – d.b Nov 10 '21 at 18:08
  • 1
    @user17242583, compare `df.groupby(df.index.floor(freq="H")).apply(lambda x: x.loc[x.WindDir.idxmax(), ["WindSpeed", "CloudHgt"]])` for example – d.b Nov 10 '21 at 18:11
  • 1
    Yeah, that makes sense. I played around with that and found that you can even do bizzare stuff like `x.loc[x.WindDir.idxmax(), 'WindSpeed':'CloudHgt']` –  Nov 10 '21 at 18:13