groupby with overlapping intervals timeseries

Question

I have a time series in python pandas dataframe object and I want to create a group based on index but I want overlapping groups i.e groups are not distinct. The header_sec is the index column. Each groups consists of a 2 second window. Input dataFrame

    header_sec
1  17004 days 22:17:13 
2  17004 days 22:17:13 
3  17004 days 22:17:13 
4  17004 days 22:17:13 
5  17004 days 22:17:14
6  17004 days 22:17:14
7  17004 days 22:17:14
8  17004 days 22:17:14
9  17004 days 22:17:15
10 17004 days 22:17:15
11 17004 days 22:17:15
12 17004 days 22:17:15
13 17004 days 22:17:16
14 17004 days 22:17:16
15 17004 days 22:17:16
16 17004 days 22:17:16
17 17004 days 22:17:17
18 17004 days 22:17:17
19 17004 days 22:17:17
20 17004 days 22:17:17

My first group should have

1  17004 days 22:17:13 
2  17004 days 22:17:13 
3  17004 days 22:17:13 
4  17004 days 22:17:13 
5  17004 days 22:17:14
6  17004 days 22:17:14
7  17004 days 22:17:14
8  17004 days 22:17:14

The second group starts from the previous index and takes 1/2 of the records in previous second.

7  17004 days 22:17:14
8  17004 days 22:17:14
9  17004 days 22:17:15
10 17004 days 22:17:15
11 17004 days 22:17:15
12 17004 days 22:17:15
13 17004 days 22:17:16
14 17004 days 22:17:16

Third group .....

13 17004 days 22:17:16
14 17004 days 22:17:16
15 17004 days 22:17:16
16 17004 days 22:17:16
17 17004 days 22:17:17
18 17004 days 22:17:17
19 17004 days 22:17:17
20 17004 days 22:17:17

If I do groupby on index,

  dfgroup=df.groupby(df.index)

this gives one group per second. What would be the best way to merge these groups?

No, since there is missing data. These are actually samples collected per second. The actual data has 100 samples per second but sometimes there is a missing entry. — AMisra, Aug 10 '16 at 21:40
Examples in your question are not consistent. Second group takes 2 rows from 22:17:14, but third group does not take any from 22:17:15. First group takes all values from 22:17:15, but second group does not take all values from 22:17:16. Can you please clarify? I am thinking, what you need is all values from a given second + last 2 values from previous second + first 2 values from next second. Correct? — Kartik, Aug 10 '16 at 21:47
Each group consists of a 2 second window, but it moves in an overlapping way, where only half of the entries of a previous second go in next set. Hence take 1/2 entries from previous second + current second + 1/2 entries from next second. Third group takes 1/2 of the entries of previous second i.e entry number 13 and 14 are repeated giving 13-20 — AMisra, Aug 10 '16 at 21:52

Kartik · Accepted Answer · 2016-08-10T23:33:06.053

1

Here is a technique:

import numpy as np # if you have not already done this

grouped = df.groupby(df.index)

for name, group in grouped:
    try:
        prev_sec = df.loc[(name - pd.to_timedelta(1, unit='s')), :]
    except KeyError:
        prev_sec = pd.DataFrame(columns=group.columns)
    try:
        next_sec = df.loc[(name + pd.to_timedelta(1, unit='s')), :]
    except KeyError:
        next_sec = pd.DataFrame(columns=group.columns)
    Pn = 2 # replace this with int(len(prev_sec)/2) to get half rows from previous second
    Nn = 2 # replace this with int(len(next_sec)/2) to get half rows from next second
    group = pd.concat([prev_sec.iloc[-Pn:,:], group, next_sec.iloc[:Nn,:]])

    # Replace the below lines with your operations
    print(name, group)

edited Aug 10 '16 at 23:33

answered Aug 10 '16 at 21:55

Kartik

8,347
39
73

I get an error at this line prev_sec = df[df['header_sec'].isin(name - pd.to_timedelta(1, unit='s'))] since "header sec" is the index and not a column. I tried replacing it to df[df[df.index].isin(name - pd.to_timedelta(1, unit='s'))] but still gives a key error "TimedeltaIndex(['17004 days 22:17:13', '17004 days 22:17:13',\n '17004 days 22:17:13', '17004 days ..... not in index" – AMisra Aug 10 '16 at 22:43
Edited my answer. You need `df[df.index.isin(...)]`. – Kartik Aug 10 '16 at 23:06
I tried that too but it still gives an error only list-like objects are allowed to be passed to isin(), you passed a [Timedelta]. This is because my index is already timedelta. This is what I do before groupby df.loc[:,['header_stamp_secs_x']] = pd.to_timedelta(df.header_stamp_secs_x, unit='s') df.set_index('header_stamp_secs_x', inplace=True) – AMisra Aug 10 '16 at 23:19
Oh yeah... That. Sorry for the oversight. The name is just a single entity, not a list. My mistake, I sincerely apologize for the back and forth. The edit to the answer should work. – Kartik Aug 10 '16 at 23:22
sorry for being novice with dataframes, but it did not work. I think because my index is timedelta and it says only list-like objects are allowed to to be passed to isin(), you passed a [timedelta64] print type(name - np.timedelta64(1, 's')) – AMisra Aug 10 '16 at 23:31
No, you don't have to apologize for anything. I was too hasty in answering your question, and made mistakes. The edited answer has been tested and should work. – Kartik Aug 10 '16 at 23:34

groupby with overlapping intervals timeseries

1 Answers1

Linked