2

I have a time series in python pandas dataframe object and I want to create a group based on index but I want overlapping groups i.e groups are not distinct. The header_sec is the index column. Each groups consists of a 2 second window. Input dataFrame

    header_sec
1  17004 days 22:17:13 
2  17004 days 22:17:13 
3  17004 days 22:17:13 
4  17004 days 22:17:13 
5  17004 days 22:17:14
6  17004 days 22:17:14
7  17004 days 22:17:14
8  17004 days 22:17:14
9  17004 days 22:17:15
10 17004 days 22:17:15
11 17004 days 22:17:15
12 17004 days 22:17:15
13 17004 days 22:17:16
14 17004 days 22:17:16
15 17004 days 22:17:16
16 17004 days 22:17:16
17 17004 days 22:17:17
18 17004 days 22:17:17
19 17004 days 22:17:17
20 17004 days 22:17:17

My first group should have

1  17004 days 22:17:13 
2  17004 days 22:17:13 
3  17004 days 22:17:13 
4  17004 days 22:17:13 
5  17004 days 22:17:14
6  17004 days 22:17:14
7  17004 days 22:17:14
8  17004 days 22:17:14

The second group starts from the previous index and takes 1/2 of the records in previous second.

7  17004 days 22:17:14
8  17004 days 22:17:14
9  17004 days 22:17:15
10 17004 days 22:17:15
11 17004 days 22:17:15
12 17004 days 22:17:15
13 17004 days 22:17:16
14 17004 days 22:17:16

Third group .....

13 17004 days 22:17:16
14 17004 days 22:17:16
15 17004 days 22:17:16
16 17004 days 22:17:16
17 17004 days 22:17:17
18 17004 days 22:17:17
19 17004 days 22:17:17
20 17004 days 22:17:17

If I do groupby on index,

  dfgroup=df.groupby(df.index)

this gives one group per second. What would be the best way to merge these groups?

AMisra
  • 1,869
  • 2
  • 25
  • 45
  • Are there always going to be 4 entries per second? – SO44 Aug 10 '16 at 21:29
  • No, since there is missing data. These are actually samples collected per second. The actual data has 100 samples per second but sometimes there is a missing entry. – AMisra Aug 10 '16 at 21:40
  • Examples in your question are not consistent. Second group takes 2 rows from 22:17:14, but third group does not take any from 22:17:15. First group takes all values from 22:17:15, but second group does not take all values from 22:17:16. Can you please clarify? I am thinking, what you need is all values from a given second + last 2 values from previous second + first 2 values from next second. Correct? – Kartik Aug 10 '16 at 21:47
  • Each group consists of a 2 second window, but it moves in an overlapping way, where only half of the entries of a previous second go in next set. Hence take 1/2 entries from previous second + current second + 1/2 entries from next second. Third group takes 1/2 of the entries of previous second i.e entry number 13 and 14 are repeated giving 13-20 – AMisra Aug 10 '16 at 21:52

1 Answers1

1

Here is a technique:

import numpy as np # if you have not already done this

grouped = df.groupby(df.index)

for name, group in grouped:
    try:
        prev_sec = df.loc[(name - pd.to_timedelta(1, unit='s')), :]
    except KeyError:
        prev_sec = pd.DataFrame(columns=group.columns)
    try:
        next_sec = df.loc[(name + pd.to_timedelta(1, unit='s')), :]
    except KeyError:
        next_sec = pd.DataFrame(columns=group.columns)
    Pn = 2 # replace this with int(len(prev_sec)/2) to get half rows from previous second
    Nn = 2 # replace this with int(len(next_sec)/2) to get half rows from next second
    group = pd.concat([prev_sec.iloc[-Pn:,:], group, next_sec.iloc[:Nn,:]])

    # Replace the below lines with your operations
    print(name, group)
Kartik
  • 8,347
  • 39
  • 73
  • I get an error at this line prev_sec = df[df['header_sec'].isin(name - pd.to_timedelta(1, unit='s'))] since "header sec" is the index and not a column. I tried replacing it to df[df[df.index].isin(name - pd.to_timedelta(1, unit='s'))] but still gives a key error "TimedeltaIndex(['17004 days 22:17:13', '17004 days 22:17:13',\n '17004 days 22:17:13', '17004 days ..... not in index" – AMisra Aug 10 '16 at 22:43
  • Edited my answer. You need `df[df.index.isin(...)]`. – Kartik Aug 10 '16 at 23:06
  • I tried that too but it still gives an error only list-like objects are allowed to be passed to isin(), you passed a [Timedelta]. This is because my index is already timedelta. This is what I do before groupby df.loc[:,['header_stamp_secs_x']] = pd.to_timedelta(df.header_stamp_secs_x, unit='s') df.set_index('header_stamp_secs_x', inplace=True) – AMisra Aug 10 '16 at 23:19
  • Oh yeah... That. Sorry for the oversight. The name is just a single entity, not a list. My mistake, I sincerely apologize for the back and forth. The edit to the answer should work. – Kartik Aug 10 '16 at 23:22
  • sorry for being novice with dataframes, but it did not work. I think because my index is timedelta and it says only list-like objects are allowed to to be passed to isin(), you passed a [timedelta64] print type(name - np.timedelta64(1, 's')) – AMisra Aug 10 '16 at 23:31
  • No, you don't have to apologize for anything. I was too hasty in answering your question, and made mistakes. The edited answer has been tested and should work. – Kartik Aug 10 '16 at 23:34