I have a pandas.Dataframe
with a DatetimeIndex
. The data are sampled every 30 minutes (can be different). I want to resample the data every hour and, for each group, I want to extract the row with the highest value on a specific column.
E.g.
WindDir WindSpeed Temperature CloudHgt
Date
2020-01-01 00:20:00 150.0 3.6 5.0 213.0
2020-01-01 00:50:00 150.0 3.1 5.0 213.0
2020-01-01 01:20:00 130.0 4.1 5.0 183.0
2020-01-01 01:50:00 130.0 3.6 5.0 183.0
2020-01-01 02:20:00 130.0 3.1 4.0 244.0
2020-01-01 02:50:00 140.0 3.6 4.0 366.0
2020-01-01 03:20:00 130.0 2.6 3.0 99999.0
2020-01-01 03:50:00 160.0 2.6 3.0 91.0
2020-01-01 04:20:00 170.0 2.6 3.0 61.0
2020-01-01 04:50:00 180.0 2.6 3.0 30.0
2020-01-01 05:20:00 150.0 1.5 3.0 183.0
2020-01-01 05:50:00 190.0 2.1 3.0 30.0
2020-01-01 06:20:00 190.0 2.1 3.0 122.0
2020-01-01 06:50:00 220.0 1.5 3.0 152.0
2020-01-01 07:20:00 230.0 2.1 3.0 213.0
2020-01-01 07:50:00 230.0 1.5 3.0 244.0
2020-01-01 08:20:00 999.0 1.0 3.0 335.0
2020-01-01 08:50:00 999.0 0.5 3.0 335.0
2020-01-01 09:20:00 120.0 2.1 3.0 335.0
And I want to resample this obtaining:
WindDir WindSpeed Temperature CloudHgt
Date
2020-01-01 00:00:00 150.0 3.6 5.0 213.0
2020-01-01 01:00:00 130.0 4.1 5.0 183.0
2020-01-01 02:00:00 130.0 3.6 5.0 183.0
2020-01-01 03:00:00 140.0 3.6 4.0 366.0
2020-01-01 04:00:00 160.0 2.6 3.0 91.0
2020-01-01 05:00:00 180.0 2.6 3.0 30.0
2020-01-01 06:00:00 190.0 2.1 3.0 30.0
2020-01-01 07:00:00 230.0 2.1 3.0 213.0
2020-01-01 08:00:00 230.0 1.5 3.0 244.0
2020-01-01 09:00:00 120.0 2.1 3.0 335.0
I read several questions on SO such as this, this, this, this and this. Hoever most of these focus on Dataframes without DatetimeIndex for which the order of the samples does not matter. Also, since Groupby re-labels the group with the start-time of that group, looking for samples with the same index is unfeasible.
I found a working solution using apply(lambda group: group.nlargest(1, columns=column))
. However this is extremely slow on large Dataframes.
At the moment my code looks something like:
# define grouper
grouper = pd.Grouper(freq='1H', offset='30min')
# aggregate data (except wind dir)
resampled = (data
.groupby(grouper)
.apply(lambda group: group.nlargest(1, columns=column))
.reset_index(level=-1, drop=True)
).shift(30,'min')
Is there a better solution that does not uses the apply
function as it is extremely slow?