I have a dataframe with datetimes as an index, and 2 columns.
What I want to do is split the df up into days, split those days up into blocks where the value of column 2 is equal/higher than 70, and calculate the both the duration (index) and the average from column 1 for that period.
I can think of ways of doing it, but those include dictionaries and ugly loops or groupby from itertools, so I prefer a more pandanic solution.
I'm already splitting up my bigger dataframe using: for x in df.groupby(pd.Grouper(freq='24H', base=12))
, which splits them into chuncks posted below
2018-04-13 12:00:00 11.4 70.0
2018-04-13 13:00:00 11.3 73.0
2018-04-13 14:00:00 10.6 70.0
2018-04-13 15:00:00 11.2 67.0
2018-04-13 16:00:00 10.4 67.0
2018-04-13 17:00:00 10.8 66.0
2018-04-13 18:00:00 10.8 67.0
2018-04-13 19:00:00 10.8 67.0
2018-04-13 20:00:00 10.7 69.0
2018-04-13 21:00:00 10.3 70.0
2018-04-13 22:00:00 10.1 71.0
2018-04-13 23:00:00 9.8 73.0
2018-04-14 00:00:00 9.4 75.0
2018-04-14 01:00:00 9.1 77.0
2018-04-14 02:00:00 8.6 80.0
2018-04-14 03:00:00 7.8 84.0
2018-04-14 04:00:00 7.3 87.0
2018-04-14 05:00:00 8.3 87.0
2018-04-14 06:00:00 8.3 88.0
2018-04-14 07:00:00 8.6 88.0
2018-04-14 08:00:00 9.0 87.0
2018-04-14 09:00:00 11.3 83.0
2018-04-14 10:00:00 12.2 81.0
2018-04-14 11:00:00 12.4 82.0
I need: (not in this formatting, just example)
2018-04-13 12:00:00 - 3 hours - 71.0
2018-04-13 21:00:00 - 14 hours - 80.86
Any clean suggestions for this problem? It's kind of related to this, but I couldn't transform it to my example.
I also though about filtering out where column 2 < 70, but than I'd have trouble getting the consecutiveness again
EDIT: my final solution
df['Time'] = df.index
for day, daydf in df.groupby(pd.Grouper(freq='24H', base=12)):
df_filtered = daydf[daydf['v2'] >= 70]
grouping_key = df_filtered.Time.diff().dt.total_seconds().ne(3600).cumsum()
grouper = df_filtered.groupby(grouping_key)
aggregated_data = grouper.agg({'Time': 'first', 'v1': 'mean', 'v2': len})
print(aggregated_data)