2

I have the following dataframe...

             date_time  day  value1  value2
0  2023-03-15 00:00:00    3       1       1
1  2023-03-15 06:00:00    3       2       2
2  2023-03-15 12:00:00    3       3       3
3  2023-03-15 18:00:00    3       4       4
4  2023-03-16 00:00:00    4       5       5
5  2023-03-16 06:00:00    4       6       6
6  2023-03-16 12:00:00    4       7       7
7  2023-03-16 18:00:00    4       8       8
8  2023-03-17 00:00:00    5       9       9
9  2023-03-17 06:00:00    5      10      10
10 2023-03-17 12:00:00    5      11      11
11 2023-03-17 18:00:00    5      12      12
12 2023-03-20 06:00:00    1      13      13

I am trying to group the data by day, offset by 7 hours. Using the following...

rows = df.resample('24H', on='date_time', origin='epoch', offset='7H').agg({
    'date_time': 'last', 'day': 'last', 'value1': 'first', 'value2': 'last'})

This gives me...

                              date_time  day  value1  value2
date_time                                                   
2023-03-14 07:00:00 2023-03-15 06:00:00  3.0     1.0     2.0
2023-03-15 07:00:00 2023-03-16 06:00:00  4.0     3.0     6.0
2023-03-16 07:00:00 2023-03-17 06:00:00  5.0     7.0    10.0
2023-03-17 07:00:00 2023-03-17 18:00:00  5.0    11.0    12.0
2023-03-19 07:00:00 2023-03-20 06:00:00  1.0    13.0    13.0

This is almost correct, but what I also want is to group any data that has a gap of more than 24 hours into a single row, like this...

                              date_time  day  value1  value2
date_time                                                   
2023-03-14 07:00:00 2023-03-15 06:00:00  3.0     1.0     2.0
2023-03-15 07:00:00 2023-03-16 06:00:00  4.0     3.0     6.0
2023-03-16 07:00:00 2023-03-17 06:00:00  5.0     7.0    10.0
2023-03-19 07:00:00 2023-03-20 06:00:00  1.0    11.0    13.0

After I had resampled the data, I tried to group the resampled data, using...

rows = rows.groupby(((rows.date_time - rows.date_time.shift(-1)) < '-24H').cumsum()
       ).agg({'date_time': 'last', 'day': 'last', 'value1': 'first', 'value2': 'last'})

...but this resulted in...

                    date_time  day  value1  value2
date_time                                         
0         2023-03-17 06:00:00  5.0     1.0    10.0
1         2023-03-20 06:00:00  1.0    11.0    13.0

So in a nutshell, I want the first 3 rows from the resample, and the last row from the grouping to achieve what I am after.

I'm close, but can't quite figure it out.

Is there a way to do what I am after?

1 Answers1

1

This is essentially a gap-and-island problem: when the difference is 1 day or less, you create a new island.

agg_dict = {
    "date_time": "last",
    "day": "last",
    "value1": "first",
    "value2": "last",
}
rows = df.resample("24H", on="date_time", offset="7H").agg(agg_dict).dropna()

island = rows.index.to_series().diff().le("1D").cumsum().rename("island")
rows = rows.groupby(island).agg(agg_dict)
Code Different
  • 90,614
  • 16
  • 144
  • 163
  • Excellent, thank you. A very clear and concise answer. I'd also give you a +1 for the term 'gap-and-island', if I could. – Wannabe-Coder Mar 26 '23 at 09:11
  • My code is now working :-), but it got me thinking about a scenario this would not work. If I removed the offset, The Friday/Monday combination would then contain 2 days worth of data. How could I prevent this edge case from happening, i.e. can a 2nd test be added? – Wannabe-Coder Mar 26 '23 at 10:58
  • I don't understand the edge case you described but you can tweak the gap-and-island algorithm. Can you make a new question to clarify what you meant? – Code Different Mar 26 '23 at 12:16
  • Hi Code Different, I have created a new question that can be found here [link](https://stackoverflow.com/questions/75849042/pandas-group-rows-on-date-condition-part-2) Hope it makes sense. – Wannabe-Coder Mar 26 '23 at 16:27