6

I have data in pandas dataframe with 1 minute time step. This data is not recorded continuously, now I would like to split all my data into separate event based on the condition below : If there is continuous data recorded for 5min or more then only it is considered as a event and for such event data need to extracted separately. Is there a way to implement this in pandas dataframe.

My data look likes this (with the Event column as result):

Date                  X    Event
2017-06-06 01:08:00 0.019   1
2017-06-06 01:09:00 0.005   1
2017-06-06 01:10:00 0.03    1
2017-06-06 01:11:00 0.005   1
2017-06-06 01:12:00 0.003   1
2017-06-06 01:13:00 0.001   1
2017-06-06 01:14:00 0.039   1
2017-06-06 01:15:00 0.003   1
2017-06-06 01:17:00 0.001   nan
2017-06-06 01:25:00 0.006   nan
2017-06-06 01:26:00 0.006   nan
2017-06-06 01:27:00 0.032   nan
2017-06-06 01:29:00 0.013   2
2017-06-06 01:30:00 0.065   2
2017-06-06 01:31:00 0.013   2
2017-06-06 01:32:00 0.001   2
2017-06-06 01:33:00 0.02    2
2017-06-06 01:38:00 0.05    nan
2017-06-06 01:40:00 0.025   3
2017-06-06 01:41:00 0.01    3
2017-06-06 01:42:00 0.008   3
2017-06-06 01:43:00 0.009   3
2017-06-06 01:44:00 0.038   3
2017-06-06 01:45:00 0.038   3

Your suggestion is highly appreciated.

Using the solution provided by nnnmmm, result looks like this

2015-01-01 03:24:00       NaN
2015-01-01 04:59:00       NaN
2015-01-01 05:01:00       NaN
2015-01-01 05:02:00       NaN
2015-01-01 05:03:00       NaN
2015-01-13 01:12:00       1.0
2015-01-13 01:13:00       1.0
2015-01-13 01:14:00       1.0
2015-01-13 01:15:00       1.0
2015-01-13 01:16:00       1.0
2015-01-13 01:49:00       1.0
2015-01-13 01:50:00       1.0
2015-01-13 01:51:00       1.0
2015-01-13 01:52:00       1.0
2015-01-13 01:53:00       1.0
2015-01-13 01:54:00       1.0
2015-01-13 01:55:00       1.0

In this case, there is time change between 01:16:00 and 01:49:00, it shouldn't consider it as a same event rather 01:49:00 should be the second event.

PUJA
  • 639
  • 1
  • 8
  • 18
  • Are there pre-defined boundaries for the 5min interval ? For example, what should happen if your data started with 01:02:00 instead of 01:08:00 ? – PaW Feb 10 '18 at 14:55
  • 2
    Event 3 actually shouldn't be an Event if I understand your intent correctly. There is a jump between 40 and 42. – nnnmmm Feb 10 '18 at 15:11
  • @PW, No there is no any predefined boundaries and sometimes I might need different boundary as well. sorry there shouldn't be jump between 40 and 42 otherwise it shouldn't be counted as an event. – PUJA Feb 10 '18 at 20:44
  • 1
    There are several pandas solutions that use some combo of a `diff`-like operation and `cumsum` as a trick to find consecutive values although I didn't find anything that exactly answers the question here. See for example https://stackoverflow.com/questions/40802800/pandas-dataframe-how-to-groupby-consecutive-values or https://stackoverflow.com/questions/14358567/finding-consecutive-segments-in-a-pandas-data-frame – JohnE Feb 12 '18 at 12:56

3 Answers3

5

This is a bit rough (and not very concise), but you could do something like the following (you could of course make it more concise and leave out the intermediate variables, but I left them in here to make it easier to understand what's happening).

df['new'] = (df.reset_index().Date.diff() == pd.Timedelta('1min')).astype(int).values
df['grp'] = (df.new != 1).cumsum()
df['cnt'] = df.groupby('grp')['new'].transform(size)

df['event'] = df['cnt'] > 4
df['Event'] = ((df.event) & (df.new != 1)).cumsum()
df['Event'] = np.where( df.event, df.Event, np.nan )

                         X  new  grp  cnt  event  Event
Date                                                   
2017-06-06 01:08:00  0.019    0    1    8   True    1.0
2017-06-06 01:09:00  0.005    1    1    8   True    1.0
2017-06-06 01:10:00  0.030    1    1    8   True    1.0
2017-06-06 01:11:00  0.005    1    1    8   True    1.0
2017-06-06 01:12:00  0.003    1    1    8   True    1.0
2017-06-06 01:13:00  0.001    1    1    8   True    1.0
2017-06-06 01:14:00  0.039    1    1    8   True    1.0
2017-06-06 01:15:00  0.003    1    1    8   True    1.0
2017-06-06 01:17:00  0.001    0    2    1  False    NaN
2017-06-06 01:25:00  0.006    0    3    3  False    NaN
2017-06-06 01:26:00  0.006    1    3    3  False    NaN
2017-06-06 01:27:00  0.032    1    3    3  False    NaN
2017-06-06 01:29:00  0.013    0    4    5   True    2.0
2017-06-06 01:30:00  0.065    1    4    5   True    2.0
2017-06-06 01:31:00  0.013    1    4    5   True    2.0
2017-06-06 01:32:00  0.001    1    4    5   True    2.0
2017-06-06 01:33:00  0.020    1    4    5   True    2.0
2017-06-06 01:38:00  0.050    0    5    1  False    NaN
2017-06-06 01:40:00  0.025    0    6    6   True    3.0
2017-06-06 01:41:00  0.010    1    6    6   True    3.0
2017-06-06 01:42:00  0.008    1    6    6   True    3.0
2017-06-06 01:43:00  0.009    1    6    6   True    3.0
2017-06-06 01:44:00  0.038    1    6    6   True    3.0
2017-06-06 01:45:00  0.038    1    6    6   True    3.0
JohnE
  • 29,156
  • 8
  • 79
  • 109
3

There is probably a more elegant way to solve this, but this should do the trick:

# True where the previous timestamp is one minute away
prev_ok = pd.Series(df['Date'].diff().values == np.timedelta64(1, 'm'))
# True where the previous four rows of prev_ok are True
a = prev_ok.rolling(4).sum() == 4
# extend True back down the previous four rows, this could be done with a loop
b = a | a.shift(-1) | a.shift(-2) | a.shift(-3) | a.shift(-4)
# calculate edges from False to True to get the event indices
c = (~a.shift(-3).fillna(False) & a.shift(-4)).cumsum()
# only display event indices where b is 
df['Event'] = c.mask(~b)

The output is

                  Date      X  Event
0  2017-06-06 01:08:00  0.019    1.0
1  2017-06-06 01:09:00  0.005    1.0
2  2017-06-06 01:10:00  0.030    1.0
3  2017-06-06 01:11:00  0.005    1.0
4  2017-06-06 01:12:00  0.003    1.0
5  2017-06-06 01:13:00  0.001    1.0
6  2017-06-06 01:14:00  0.039    1.0
7  2017-06-06 01:15:00  0.003    1.0
8  2017-06-06 01:17:00  0.001    NaN
9  2017-06-06 01:25:00  0.006    NaN
10 2017-06-06 01:26:00  0.006    NaN
11 2017-06-06 01:27:00  0.032    NaN
12 2017-06-06 01:29:00  0.013    2.0
13 2017-06-06 01:30:00  0.065    2.0
14 2017-06-06 01:31:00  0.013    2.0
15 2017-06-06 01:32:00  0.001    2.0
16 2017-06-06 01:33:00  0.020    2.0
17 2017-06-06 01:38:00  0.050    NaN
18 2017-06-06 01:40:00  0.025    NaN
19 2017-06-06 01:42:00  0.010    NaN
20 2017-06-06 01:43:00  0.008    NaN
21 2017-06-06 01:44:00  0.009    NaN
22 2017-06-06 01:45:00  0.038    NaN

Fun fact: The calculation of a is basically an erosion of a 1-D image with a length 4 structuring element, and the calculation of b is a dilation with the same structuring element. Taken together, the computation of b from prev_ok is an opening, i.e. what was True in prev_ok is True in b only if that True is part of a consecutive group of five Trues.

nnnmmm
  • 7,964
  • 4
  • 22
  • 41
  • First 2 lines work fine for you? I have to wrap `prev_ok` in `pd.Series` as numpy array doesn't have `rolling` method – Georgy Feb 10 '18 at 16:08
  • You're right, I edited my answer. Originally, I did `df['prev_ok'] = ...` but then decided I didn't want to clutter up the final DataFrame. – nnnmmm Feb 10 '18 at 16:10
  • You've come up with a very interesting solution! But I hope there is a better way to do it :) – Georgy Feb 10 '18 at 16:17
  • @nnnmmm, thanks for interesting solution, but I found it's not working well as expected, I have added one example where it didn't work (see in my question), what do you think on this ? – PUJA Feb 10 '18 at 21:15
  • You're right, I need to do the index calculation before the dilation. Edited my answer. – nnnmmm Feb 11 '18 at 09:50
2

I may give a similar example, where you can improvise,..but without any external library.

For example, in association with your time-interval data, we have a simple list

x = [1,2,3,4,6,7,8,9,10,11,12,14,15,17,18,19,20,21,22,23,24,25,28,30,32,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]

We will denote the data True if it is "continuous" for at least 5 terms.

To do this,first we obtain the difference list, and assign 1 if the difference is 1, or assign 0 if otherwise. This is done by y = [int(round(1/(x[i]-x[i-1]))) for i in range(1, len(x))].

After this, we will get the index where the difference is assign 0, and use it to check your condition.

Full code :

import copy

x = [1,2,3,4,6,7,8,9,10,11,12,14,15,17,18,19,20,21,22,23,24,25,28,30,32,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
y = [int(round(1/(x[i]-x[i-1]))) for i in range(1, len(x))]
z = copy.deepcopy(y);


zeros_check = [abs(y[i]-1)*i for i in range(0,len(y))];
zeros_id = list(set(zeros_check));
zeros_id.remove(0);
zeros_id.append(len(y));


idx = 0;
for i in zeros_id:
   if sum(y[idx+1:i])>=5:
      z[idx+1:i] = [True for i in range(idx,idx+i-1)];
   else:
      z[idx:i+1] = [False for i in range(idx,idx+i+1)];
   idx = i;

for i,j,k in zip(x,y,z):
     print(i,j,k)

Output :

1 1 False
2 1 False
3 1 False
4 0 False
6 1 True
7 1 True
8 1 True
9 1 True
10 1 True
11 1 True
12 0 False
14 1 False
15 0 False
17 1 True
18 1 True
19 1 True
20 1 True
21 1 True
22 1 True
23 1 True
24 1 True
25 0 False
28 0 False
30 0 False
32 0 False
34 1 True
35 1 True
36 1 True
37 1 True
38 1 True
39 1 True
40 1 True
41 1 True
42 1 True
43 1 True
44 1 True
45 1 True
46 1 True
47 1 True
48 1 True
49 1 True