1

I have a dataframe with datetimes as an index, and 2 columns.

What I want to do is split the df up into days, split those days up into blocks where the value of column 2 is equal/higher than 70, and calculate the both the duration (index) and the average from column 1 for that period.

I can think of ways of doing it, but those include dictionaries and ugly loops or groupby from itertools, so I prefer a more pandanic solution.

I'm already splitting up my bigger dataframe using: for x in df.groupby(pd.Grouper(freq='24H', base=12)), which splits them into chuncks posted below

2018-04-13 12:00:00  11.4  70.0
2018-04-13 13:00:00  11.3  73.0
2018-04-13 14:00:00  10.6  70.0
2018-04-13 15:00:00  11.2  67.0
2018-04-13 16:00:00  10.4  67.0
2018-04-13 17:00:00  10.8  66.0
2018-04-13 18:00:00  10.8  67.0
2018-04-13 19:00:00  10.8  67.0
2018-04-13 20:00:00  10.7  69.0
2018-04-13 21:00:00  10.3  70.0
2018-04-13 22:00:00  10.1  71.0
2018-04-13 23:00:00   9.8  73.0
2018-04-14 00:00:00   9.4  75.0
2018-04-14 01:00:00   9.1  77.0
2018-04-14 02:00:00   8.6  80.0
2018-04-14 03:00:00   7.8  84.0
2018-04-14 04:00:00   7.3  87.0
2018-04-14 05:00:00   8.3  87.0
2018-04-14 06:00:00   8.3  88.0
2018-04-14 07:00:00   8.6  88.0
2018-04-14 08:00:00   9.0  87.0
2018-04-14 09:00:00  11.3  83.0
2018-04-14 10:00:00  12.2  81.0
2018-04-14 11:00:00  12.4  82.0

I need: (not in this formatting, just example)

2018-04-13 12:00:00  - 3 hours - 71.0
2018-04-13 21:00:00  - 14 hours - 80.86

Any clean suggestions for this problem? It's kind of related to this, but I couldn't transform it to my example.

I also though about filtering out where column 2 < 70, but than I'd have trouble getting the consecutiveness again

EDIT: my final solution

df['Time'] = df.index
for day, daydf in df.groupby(pd.Grouper(freq='24H', base=12)):
    df_filtered = daydf[daydf['v2'] >= 70]
    grouping_key = df_filtered.Time.diff().dt.total_seconds().ne(3600).cumsum()
    grouper = df_filtered.groupby(grouping_key)
    aggregated_data = grouper.agg({'Time': 'first', 'v1': 'mean', 'v2': len})
    print(aggregated_data)
Joost
  • 3,609
  • 2
  • 12
  • 29
  • My data is hourly, but I'm uncertain how that changes thing. I can count rows, but my goal is to group them together and to do something with the group specific value. – Joost Apr 17 '18 at 15:19

1 Answers1

2

Using after filter you can using groupby with diff, to find the continue series and assign them into same group, then you have this group key , we can using groupby + agg

s.groupby(s.Time.diff().dt.total_seconds().ne(3600).cumsum()).agg({'Time':'first','v2':'mean','v1':len})
Out[701]: 
                    Time         v2   v1
Time                                    
1    2018-04-13 12:00:00  71.000000  3.0
2    2018-04-13 21:00:00  71.333333  3.0

Data input

Time,v1,v2
2018-04-13 12:00:00,11.4,70.0
2018-04-13 13:00:00,11.3,73.0
2018-04-13 14:00:00,10.6,70.0
2018-04-13 15:00:00,11.2,67.0
2018-04-13 16:00:00,10.4,67.0
2018-04-13 17:00:00,10.8,66.0
2018-04-13 18:00:00,10.8,67.0
2018-04-13 19:00:00,10.8,67.0
2018-04-13 20:00:00,10.7,69.0
2018-04-13 21:00:00,10.3,70.0
2018-04-13 22:00:00,10.1,71.0
2018-04-13 23:00:00,9.8,73.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • It looks pretty cryptic, but it's working. Thanks a lot! – Joost Apr 17 '18 at 15:29
  • Yes, I was still piecing together how it exactly worked. I added my final solution to the question. Any comment on that? Also one follow-up question, although its not really important. Can you also use the index instead a column name to do these operations on? It's throwing me `AttributeError: 'DatetimeIndex' object has no attribute 'diff'` – Joost Apr 17 '18 at 15:46
  • @Joost yes, you can just do `df.index.get_level_values(0)` – BENY Apr 17 '18 at 15:58