Find and match a sub-sequence to the values of a column in pandas dataframe

Question

I am interested to find out if:

a sequence of strings or numbers is contained as it is in a bigger or larger sequence of strings or numbers.

Following is a pandas dataframe with two columns: Id and Time. This dataframe is sorted beforehand by the values of Time.

import pandas as pd

label1 = ['422','422','422','428','428','453','453','453','453','453','421','421','421','421','421','422','422','422','424','424','424']
label2 = ['13:08','13:08','13:09','13:12','13:12','13:16','13:16','13:17','13:17','13:18','13:20','13:20','13:20','13:20','13:22', '13:23','13:24','13:24', '13:25','13:25','13:26']

d = {'Id':label1,'Time':label2}
df=pd.DataFrame(d)
df

The output df looks like the following:

In [4]: df
Out[4]: 
     Id   Time
0   422  13:08
1   422  13:08
2   422  13:09
3   428  13:12
4   428  13:12
5   453  13:16
6   453  13:16
7   453  13:17
8   453  13:17
9   453  13:18
10  421  13:20
11  421  13:20
12  421  13:20
13  421  13:20
14  421  13:22
15  422  13:23
16  422  13:24
17  422  13:24
18  424  13:25
19  424  13:25
20  424  13:26

What I have done so far. I have tried to generate a smaller dataframe as follows:

df["Id"] = df['Id'].astype('int')
bb1= df[df['Id'].diff(-1).ne(0)]
bb1

which has produced the following output:

In [59]: bb1
Out[59]: 
     Id   Time
2   422  13:09
4   428  13:12
9   453  13:18
14  421  13:22
17  422  13:24
20  424  13:26

bb dataframe contains the ids in the order they have appeared. They appear in the following order: S1=[422, 428,453,421,422,424].

Besides, there is a given sub-sequence which is S2=[421,422,424], which happens to be contained in S1.

I need to find if bb dataframe contain a sub-sequence of Ids as reflected in S2=[421, 422, 424]. The answer to which should be returned with the following output if the sub-sequence gets identified:

index Id Time

10 421 13:20 

14 421 13:22 

15 422 13:23 

17 422 13:24 

18 424 13:25 

20 424 13:26

The desired output contains the first and last time stamp and its associated index.

I would really appreciate your help.

S1 and S2 are denoted as sets which is unordered should they be a list? — Scott Boston, Jul 10 '18 at 13:37
Yes, they can be considered as a list. I'll make tho necessary changes. — Salil Sharma, Jul 10 '18 at 13:41
In your desire result, I think the `17 422 13:24` should be `16 422 13:24`? — YaOzI, Jul 10 '18 at 19:32
The desired output contains the first and last time stamp of an Id. In case of 422, the first time stamp is 13:23 (index:15) and similarly the last time stamp would be 13:24 (index:17). The desired output is constructed in that way. — Salil Sharma, Jul 11 '18 at 09:08

YaOzI · Accepted Answer · 2018-07-10T19:30:43.513

Working start from your bb1, the key is sub-sequence match, I found a solution here, and made a slightly modify to fit your situation:

S2 = [421,422,424]
N = len(S2)
# Sub-sequence matching
sub = (bb1.Id.rolling(window=N)
             .apply(lambda x: (x == S2).all(), raw=True)
             .mask(lambda x: x == 0)
             .bfill(limit=N))
print(sub)
# Output
2     NaN
4     NaN
9     1.0
14    1.0
17    1.0
20    1.0
Name: Id, dtype: float64

# And for final results
sub = sub[sub.eq(1)]
beg = sub.index[0] + 1
end = sub.index[-1]
res = df.loc[beg:end].drop_duplicates(keep='first')
print(res)
# Output
     Id   Time
10  421  13:20
14  421  13:22
15  422  13:23
16  422  13:24
18  424  13:25
20  424  13:26

Find and match a sub-sequence to the values of a column in pandas dataframe

1 Answers1