2

I'm trying to check for the sequence of B-B-B in the dataframe.

d = {'A': ['A','B','C','D','B','B','B','A','A','E','F','B','B','B','F','A','A']}
testdf = pd.DataFrame(data=d)

array = []
seq = pd.Series(['B', 'B', 'B'])

for i in testdf.index:
    
    if testdf.A[i:len(seq)] == seq:
        
        array.append(testdf.A[i:len(seq)+1])

I get an error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I get it working? I don't understand what's "ambiguous" about this code

My desired output here is:

A, F
stanvooz
  • 522
  • 3
  • 19
  • 1
    I'm confused, how do you get `A,F`? – Umar.H Jul 17 '20 at 17:48
  • 1
    @Manakin I'm looking for the letter after the three B's occurring. B-B-B-A, B-B-B-F – stanvooz Jul 17 '20 at 17:50
  • Do you really need to use a Dataframe and Series ? (if that is an example or a more complex case), or could we do it other way ? – azro Jul 17 '20 at 17:52
  • @azro It has to be a dataframe, yes, it's an simplified sample from a bigger project – stanvooz Jul 17 '20 at 17:53
  • 1
    It seems more like sub-string pattern matching...Take a look at KMP algorithm. And for the error part use `(testdf.A[i:i+len(seq)] == seq).all()` because `testdf.A[i:i+len(seq)] == seq` would give a boolean numpy array. – Ch3steR Jul 17 '20 at 18:04

2 Answers2

4
  1. The ambiguous comparison comes from the fact that when you test 2 Series for equalty (they should be same size), a pair comparison is done and you obtain a Series with only True/False value, you should then decide if you want all true, all false, at least one true ... using .any(), .all(), ...

    s1 = pd.Series(['B', 'B', 'B'])
    s2 = pd.Series(['A', 'B', 'B'])
    
    print(s1 == s2)
    0    False
    1     True
    2     True
    dtype: bool
    
    print((s1 == s2).all())
    False
    
  2. To access a subsequence, prefer the use of .iloc

  3. You need to use [i:i + len(seq)] and not [i:len(seq)] because this is a [from:to] notation

  4. You need to use Series.reset_index(drop=True) because to compare series they must have the same index, so as seq if always indexed 0,1,2 you need same for sht subsequence you compute (because testdf.A.iloc[1:3] is indexed 1,2,3]

  5. Verify the length before checking the Series to avoid an Exception at the end when the subsequence will be smaller

You end with :

values = {'A': ['A', 'B', 'C', 'D', 'B', 'B', 'B', 'A', 'A', 'E', 'F', 'B', 'B', 'B', 'F', 'A', 'A']}
testdf = pd.DataFrame(values)
array = []
seq = pd.Series(['B', 'B', 'B'])
for i in testdf.index:
    test_seq = testdf.A.iloc[i:i + len(seq)].reset_index(drop=True)
    if len(test_seq) == len(seq) and (test_seq == seq).all():
        array.append(testdf['A'].iloc[i + len(seq)])
print(array)  # ['A', 'F']
azro
  • 53,056
  • 7
  • 34
  • 70
1

Instead of iterating over every row in the DataFrame, we can iterate over the much smaller sequence (Much beter when len(seq) << len(df)). Use shift + np.logical_and.reduce to locate the sequence in the DataFrame and where it ends. Then we'll roll to get the next row after, which are the values you want. (Modified slightly from my related answer here)

import numpy as np

def find_next_row(seq, df, col):
    seq = seq[::-1]  # to get last index
    m = np.logical_and.reduce([df[col].shift(i).eq(seq[i]) for i in range(len(seq))])

    m = np.roll(m, 1)
    m[0] = False  # Don't wrap around
    
    return df.loc[m]
    # return df.loc[m, col].tolist()

find_next_row(['B', 'B', 'B'], df, col='A')
#    A
#7   A
#14  F

If you just want the list and don't care for the DataFrame, change the return to what's currently commented out: return df.loc[m, col].tolist()

find_next_row(['B', 'B', 'B'], df, col='A')
#['A', 'F']
ALollz
  • 57,915
  • 7
  • 66
  • 89