4

Given activity logs in a Pandas data-frame which consists of an "id" going through a sequence of "action" at specified "timestamp" - I want to retain rows corresponding to a specified sequence of actions.

For e.g. Input data

import pandas as pd 
# Create a sample data-frame from a dictionary 
id = ['A123', 'A123', 'A123', 'A123', 'A123', 'A123', 'A234', 'A234', 'A234', 'A234', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341', 'A341']
action = ['A', 'B', 'C', 'D', 'B', 'A', 'B', 'A', 'C', 'D', 'D', 'B', 'C', 'D', 'A', 'B', 'C', 'D', 'B', 'C']
timestamp = ['1', '2', '3', '4', '5', '6', '1', '2', '3', '4', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
the_dict = {'id': id, 'action': action, 'timestamp': timestamp}
# This is the sample data-frame with columns:
# id    action    timestamp
# Each id when ordered by timestamp then action gives the sequence of actions taken by the id
dataFrame = pd.DataFrame(the_dict)
######################################
# Input data
######################################
#      id action timestamp
#0   A123      A         1
#1   A123      B         2
#2   A123      C         3
#3   A123      D         4
#4   A123      B         5
#5   A123      A         6
#6   A234      B         1
#7   A234      A         2
#8   A234      C         3
#9   A234      D         4
#10  A341      D         1
#11  A341      B         2
#12  A341      C         3
#13  A341      D         4
#14  A341      A         5
#15  A341      B         6
#16  A341      C         7
#17  A341      D         8
#18  A341      B         9
#19  A341      C        10

# The sequence of interest
the_sequence = ['B', 'C', 'D']

# Desired output: Group by id, order by timestamp, return all rows which match the given sequence of actions
######################################
# The output data-frame:
######################################
#      id action timestamp
#1   A123      B         2
#2   A123      C         3
#3   A123      D         4
#11  A341      B         2
#12  A341      C         3
#13  A341      D         4
#15  A341      B         6
#16  A341      C         7
#17  A341      D         8
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
sbs0202
  • 139
  • 3
  • 8

4 Answers4

2

You can use .shift logic for A, B, and C. Basically, you are checking for A rows that have B and C in the next rows. That will return the A's. Then, follow a similar protocol for B and C.

df = (df[df.groupby('id')['action'].
     apply(lambda x:
           (x == 'B') & (x.shift(-1) == 'C') & (x.shift(-2) == 'D') |
           (x == 'C') & (x.shift(1) == 'B') & (x.shift(-1) == 'D') |
           (x == 'D') & (x.shift(2) == 'B') & (x.shift(1) == 'C'))])
df

output:

    id      action  timestamp
1   A123    B       2
2   A123    C       3
3   A123    D       4
11  A341    B       2
12  A341    C       3
13  A341    D       4
15  A341    B       6
16  A341    C       7
17  A341    D       8
David Erickson
  • 16,433
  • 2
  • 19
  • 35
1

We can do cumsum + str.contains

m=df.groupby('id').action.apply(lambda x : (x+',').cumsum()).str.contains('B,C,D')
nedf=df[m]
nedf
      id action timestamp
3   A123      D         4
4   A123      B         5
5   A123      A         6
13  A341      D         4
14  A341      A         5
15  A341      B         6
16  A341      C         7
17  A341      D         8
18  A341      B         9
19  A341      C        10
BENY
  • 317,841
  • 20
  • 164
  • 234
1

If you need to find a sequence you can use np.logical_and.reduce + shift in a list comprehension, similar to my answer here. In this case there is also a grouping to consider, but given your sorting that can taken care of with shift.

The idea here is to find all rows that are equal to the first element in sequence. Then with shift we check if the row after is equal to the second element (and ensure it's in the same group). m will give us all the indices where the sequence ends, so we can form the mask with that to slice the original DataFrame.

import numpy as np

def find_seq_within_group(df, seq, seq_col, gp_col):
    seq = seq[::-1]  # to get last index
    m = np.logical_and.reduce([df[seq_col].shift(i).eq(seq[i]) & df[gp_col].shift(i).eq(df[gp_col]) 
                               for i in range(len(seq))])
    
    # Return entire sequence
    m = np.logical_or.reduce([np.roll(m, -i) for i in range(len(seq))])
    
    return df.loc[m]

# df = df.sort_values(['id', 'timestamp'])
find_seq_within_group(df=df, seq=['B', 'C', 'D'], seq_col='action', gp_col='id')

      id action timestamp
1   A123      B         2
2   A123      C         3
3   A123      D         4
11  A341      B         2
12  A341      C         3
13  A341      D         4
15  A341      B         6
16  A341      C         7
17  A341      D         8
ALollz
  • 57,915
  • 7
  • 66
  • 89
0

This might help as well:

sequence=['A','B','C','D']
n=len(sequence)
for i in  range(dataFrame.shape[0]):
    if(list(dataFrame['action'][i:i+n].values)==sequence):
        print ("the sequence starts at",i)
    else:
        continue
rahul
  • 70
  • 9