How to check for a sequence of string values in pandas dataframe and output the subsequent

Question

I'm trying to check for the sequence of B-B-B in the dataframe.

d = {'A': ['A','B','C','D','B','B','B','A','A','E','F','B','B','B','F','A','A']}
testdf = pd.DataFrame(data=d)

array = []
seq = pd.Series(['B', 'B', 'B'])

for i in testdf.index:
    
    if testdf.A[i:len(seq)] == seq:
        
        array.append(testdf.A[i:len(seq)+1])

I get an error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I get it working? I don't understand what's "ambiguous" about this code

My desired output here is:

A, F

@Manakin I'm looking for the letter after the three B's occurring. B-B-B-A, B-B-B-F — stanvooz, Jul 17 '20 at 17:50
Do you really need to use a Dataframe and Series ? (if that is an example or a more complex case), or could we do it other way ? — azro, Jul 17 '20 at 17:52
@azro It has to be a dataframe, yes, it's an simplified sample from a bigger project — stanvooz, Jul 17 '20 at 17:53
It seems more like sub-string pattern matching...Take a look at KMP algorithm. And for the error part use `(testdf.A[i:i+len(seq)] == seq).all()` because `testdf.A[i:i+len(seq)] == seq` would give a boolean numpy array. — Ch3steR, Jul 17 '20 at 18:04

azro · Accepted Answer · 2020-07-17T18:17:48.770

The ambiguous comparison comes from the fact that when you test 2 Series for equalty (they should be same size), a pair comparison is done and you obtain a Series with only True/False value, you should then decide if you want all true, all false, at least one true ... using .any(), .all(), ...
```
s1 = pd.Series(['B', 'B', 'B'])
s2 = pd.Series(['A', 'B', 'B'])

print(s1 == s2)
0    False
1     True
2     True
dtype: bool

print((s1 == s2).all())
False
```
To access a subsequence, prefer the use of .iloc
You need to use [i:i + len(seq)] and not [i:len(seq)] because this is a [from:to] notation
You need to use Series.reset_index(drop=True) because to compare series they must have the same index, so as seq if always indexed 0,1,2 you need same for sht subsequence you compute (because testdf.A.iloc[1:3] is indexed 1,2,3]
Verify the length before checking the Series to avoid an Exception at the end when the subsequence will be smaller

You end with :

values = {'A': ['A', 'B', 'C', 'D', 'B', 'B', 'B', 'A', 'A', 'E', 'F', 'B', 'B', 'B', 'F', 'A', 'A']}
testdf = pd.DataFrame(values)
array = []
seq = pd.Series(['B', 'B', 'B'])
for i in testdf.index:
    test_seq = testdf.A.iloc[i:i + len(seq)].reset_index(drop=True)
    if len(test_seq) == len(seq) and (test_seq == seq).all():
        array.append(testdf['A'].iloc[i + len(seq)])
print(array)  # ['A', 'F']

ALollz · Answer 2 · 2020-07-17T18:26:27.313

Instead of iterating over every row in the DataFrame, we can iterate over the much smaller sequence (Much beter when len(seq) << len(df)). Use shift + np.logical_and.reduce to locate the sequence in the DataFrame and where it ends. Then we'll roll to get the next row after, which are the values you want. (Modified slightly from my related answer here)

import numpy as np

def find_next_row(seq, df, col):
    seq = seq[::-1]  # to get last index
    m = np.logical_and.reduce([df[col].shift(i).eq(seq[i]) for i in range(len(seq))])

    m = np.roll(m, 1)
    m[0] = False  # Don't wrap around
    
    return df.loc[m]
    # return df.loc[m, col].tolist()

find_next_row(['B', 'B', 'B'], df, col='A')
#    A
#7   A
#14  F

If you just want the list and don't care for the DataFrame, change the return to what's currently commented out: return df.loc[m, col].tolist()

find_next_row(['B', 'B', 'B'], df, col='A')
#['A', 'F']

How to check for a sequence of string values in pandas dataframe and output the subsequent

2 Answers2

Linked