Pandas retain only a specified sub-sequence (groupby order retain subsequence)

Question

Given activity logs in a Pandas data-frame which consists of an "id" going through a sequence of "action" at specified "timestamp" - I want to retain rows corresponding to a specified sequence of actions.

For e.g. Input data

import pandas as pd 
# Create a sample data-frame from a dictionary 
id = ['A123', 'A123', 'A123', 'A123', 'A123', 'A123', 'A234', 'A234', 'A234', 'A234', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341']
action = ['A', 'B', 'C', 'D', 'B', 'A', 'B', 'A', 'C', 'D', 'D', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'B', 'C']
timestamp = ['1', '2', '3', '4', '5', '6', '1', '2', '3', '4', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
the_dict = {'id': id, 'action': action, 'timestamp': timestamp}
# This is the sample data-frame with columns:
# id    action    timestamp
# Each id when ordered by timestamp then action gives the sequence of actions taken by the id
dataFrame = pd.DataFrame(the_dict)
######################################
# Input data
######################################
#      id action timestamp
#0   A123      A         1
#1   A123      B         2
#2   A123      C         3
#3   A123      D         4
#4   A123      B         5
#5   A123      A         6
#6   A234      B         1
#7   A234      A         2
#8   A234      C         3
#9   A234      D         4
#10  A341      D         1
#11  A341      B         2
#12  A341      C         3
#13  A341      D         4
#14  A341      A         5
#15  A341      B         6
#16  A341      C         7
#17  A341      D         8
#18  A341      B         9
#19  A341      C        10

# The sequence of interest
the_sequence = ['B', 'C', 'D']

# Desired output: Group by id, order by timestamp, return all rows which match the given sequence of actions
######################################
# The output data-frame:
######################################
#      id action timestamp
#1   A123      B         2
#2   A123      C         3
#3   A123      D         4
#11  A341      B         2
#12  A341      C         3
#13  A341      D         4
#15  A341      B         6
#16  A341      C         7
#17  A341      D         8

David Erickson · Answer 1 · 2020-07-22T04:17:27.260

2

You can use .shift logic for A, B, and C. Basically, you are checking for A rows that have B and C in the next rows. That will return the A's. Then, follow a similar protocol for B and C.

df = (df[df.groupby('id')['action'].
     apply(lambda x:
           (x == 'B') & (x.shift(-1) == 'C') & (x.shift(-2) == 'D') |
           (x == 'C') & (x.shift(1) == 'B') & (x.shift(-1) == 'D') |
           (x == 'D') & (x.shift(2) == 'B') & (x.shift(1) == 'C'))])
df

output:

    id      action  timestamp
1   A123    B       2
2   A123    C       3
3   A123    D       4
11  A341    B       2
12  A341    C       3
13  A341    D       4
15  A341    B       6
16  A341    C       7
17  A341    D       8

edited Jul 22 '20 at 04:17

answered Jul 22 '20 at 00:06

David Erickson

16,433
2
19
35

1

It is better do groupby + shift ~ – BENY Jul 22 '20 at 00:10
I accidentally did A, B, C and will update later, but it’s the same exact logic so just sub ABC with BCD. – David Erickson Jul 22 '20 at 00:15
You still need groupby – BENY Jul 22 '20 at 00:29
Thanks @Yorben_s , I saw your comment. Do you mean there is something wrong with the logic, or it would be just much cleaner with a .groupby? I will test with a .groupby when I’m back from the gym :) – David Erickson Jul 22 '20 at 00:51
@ALollz gotchya, I missed that part of the requirement. I’ll update my answer when I get home. – David Erickson Jul 22 '20 at 01:30
@YOBEN_S , I think this is what you were suggesting. Thanks for pointing this out to me. See updated answer with .groupby – David Erickson Jul 22 '20 at 04:21

score 1 · Answer 2 · answered Jul 22 '20 at 00:09

We can do cumsum + str.contains

m=df.groupby('id').action.apply(lambda x : (x+',').cumsum()).str.contains('B,C,D')
nedf=df[m]
nedf
      id action timestamp
3   A123      D         4
4   A123      B         5
5   A123      A         6
13  A341      D         4
14  A341      A         5
15  A341      B         6
16  A341      C         7
17  A341      D         8
18  A341      B         9
19  A341      C        10

ALollz · Accepted Answer · 2020-07-22T01:08:40.487

If you need to find a sequence you can use np.logical_and.reduce + shift in a list comprehension, similar to my answer here. In this case there is also a grouping to consider, but given your sorting that can taken care of with shift.

The idea here is to find all rows that are equal to the first element in sequence. Then with shift we check if the row after is equal to the second element (and ensure it's in the same group). m will give us all the indices where the sequence ends, so we can form the mask with that to slice the original DataFrame.

import numpy as np

def find_seq_within_group(df, seq, seq_col, gp_col):
    seq = seq[::-1]  # to get last index
    m = np.logical_and.reduce([df[seq_col].shift(i).eq(seq[i]) & df[gp_col].shift(i).eq(df[gp_col]) 
                               for i in range(len(seq))])
    
    # Return entire sequence
    m = np.logical_or.reduce([np.roll(m, -i) for i in range(len(seq))])
    
    return df.loc[m]

# df = df.sort_values(['id', 'timestamp'])
find_seq_within_group(df=df, seq=['B', 'C', 'D'], seq_col='action', gp_col='id')

      id action timestamp
1   A123      B         2
2   A123      C         3
3   A123      D         4
11  A341      B         2
12  A341      C         3
13  A341      D         4
15  A341      B         6
16  A341      C         7
17  A341      D         8

rahul · Answer 4 · 2020-07-22T00:25:24.490

0

This might help as well:

sequence=['A','B','C','D']
n=len(sequence)
for i in  range(dataFrame.shape[0]):
    if(list(dataFrame['action'][i:i+n].values)==sequence):
        print ("the sequence starts at",i)
    else:
        continue

edited Jul 22 '20 at 00:25

answered Jul 22 '20 at 00:04

rahul

70
9

Pandas retain only a specified sub-sequence (groupby order retain subsequence)

4 Answers4