3

I have research, but found no answer to the question below.

How can I do a boolean comparison for a list of substrings in a list of strings?

Below is the code:

string = {'strings_1': ['AEAB', 'AC', 'AI'], 
             'strings_2':['BB', 'BA', 'AG'], 
             'strings_3': ['AABD', 'DD', 'PP'], 
             'strings_4': ['AV', 'AB', 'BV']}

df_string = pd.DataFrame(data = string)

substring_list = ['AA', 'AE']

for row in df_string.itertuples(index = False):
    combine_row_str = [row[0], row[1], row[2]]

    #below is the main operation
    print(all(substring in row_str for substring in substring_list for row_str in combine_row_str))

The output I get is:

False
False
False

The output I want is:

True
False
False
jpp
  • 159,742
  • 34
  • 281
  • 339
learner
  • 157
  • 1
  • 7

2 Answers2

3

Here's one way using pd.DataFrame.sum and a list comprehension:

df = pd.DataFrame(data=string)

lst = ['AA', 'AE']

df['test'] = [all(val in i for val in lst) for i in df.sum(axis=1)]

print(df)

  strings_1 strings_2 strings_3 strings_4   test
0      AEAB        BB      AABD        AV   True
1        AC        BA        DD        AB  False
2        AI        AG        PP        BV  False
jpp
  • 159,742
  • 34
  • 281
  • 339
2

Since you are using pandas, you can invoke apply row-wise and str.contains with regex to find if strings do match. The first step is to find if any of the values match the strings in the substring_list:

df_string.apply(lambda x: x.str.contains('|'.join(substring_list)), axis=1)

this returns:

   strings_1  strings_2  strings_3  strings_4
0       True      False       True      False
1      False      False      False      False
2      False      False      False      False

Now, what is not clear though is whether you want to return true if both substrings are present within a row or only either of them. If only either of them, you can simply add any() after the contains() method:

df_string.apply(lambda x: x.str.contains('|'.join(substring_list)).any(), axis=1)

this returns:

0     True
1    False
2    False
dtype: bool

For the second case jpp provides a one line solution with concating row elements into one string, but please note it will not work for corner cases when you have two elems in a row, say, "BBA" and "ABB" and you try to match for "AA". Concated string "BBAABB" will still match "AA", which is wrong. I would like to propose a solution with apply and an extra function, so that code is more readable:

def areAllPresent(vals, patterns):
  result = []
  for pat in patterns:
    result.append(any([pat in val for val in vals]))
  return all(result)

df_string.apply(lambda x: areAllPresent(x.values, substring_list), axis=1)

Due to your sample dataframe it will still return the same result, but it works for cases when matching both is necessary:

0     True
1    False
2    False
dtype: bool
user59271
  • 380
  • 2
  • 14
  • 1
    Hey thank you for the answer. What should I do if I want both 'AA' and 'AE' to be contained in each position? Meaning, Boolean check whether row 0 and column 0 contains both 'AA' and 'AE' substr. Boolean check whether row 0 and column 1 contains both 'AA' and 'AE' and so on.. – learner May 29 '18 at 19:04
  • 1
    I tried doing this but doesn't work: df_string.apply(lambda x: x.str.contains((?=substring_list), axis=1) – learner May 29 '18 at 19:20
  • You can do that with a regular expression which matches multiple look ahead groups as follows: `expr = '(?=.*' + ')(?=.*'.join(substring_list) + ')'` `df_string.apply(lambda x: x.str.contains(expr), axis=1)` in your case the regular expression is: _(?=.*AA)(?=.*AE)_ If you find my answer useful, please do marked it as an accepted one, thanks :) – user59271 May 29 '18 at 19:42
  • 1
    Thank you. Why is it not (?=.*AA.*)(?=.*AE.*) – learner May 30 '18 at 06:29
  • (?=.*AA.*)(?=.*AE.*) does the same thing, the trailing .* is redundant since ?= is a positive look-ahead operator. That means that it will match the expression group but will not capture it, the next group will be matched with the initial matching string. Essentially you are reproducing AND operator inside a regular expression. Have a look here: [Regular Expressions: Is there an AND operator?](https://stackoverflow.com/questions/469913/regular-expressions-is-there-an-and-operator?noredirect=1&lq=1) – user59271 May 30 '18 at 07:28
  • Thank you. I have looked at that post before. Sorry that I go one step behind: why is it that I need to use .* in the first place? Shouldn't the code return bool true that 'AA' is a substring of 'AABD' ? – learner May 30 '18 at 11:35
  • This downs to the case when you want to match multiple sub-strings. In case of matching just one like 'AA' in 'BAABD', [simply matching for AA works fine](https://pythex.org/?regex=AA&test_string=BAABD&ignorecase=0&multiline=0&dotall=0&verbose=0) – user59271 May 30 '18 at 12:38
  • In case of multiple sub-strings it appears to be the case that lookahead operator will not return the whole 'BAABD' to be matched against for the next group, so if your sub-strings are not aligned in the same order as they would be matched in the string, you will [get a false negative](https://pythex.org/?regex=(%3F%3DAA)(%3F%3D.*BA)&test_string=BAABD&ignorecase=0&multiline=0&dotall=0&verbose=0) – user59271 May 30 '18 at 12:39
  • So you have to use .* in your group so that you [get a correct match](https://pythex.org/?regex=(%3F%3D.*AA)(%3F%3D.*BA)&test_string=BAABD&ignorecase=0&multiline=0&dotall=0&verbose=0) – user59271 May 30 '18 at 12:41
  • After looking at this website (http://www.ocpsoft.org/tutorials/regular-expressions/and-in-regex/) and python documentation, I understand that (?=...) will start matching at whatever position it is in. So in test string 'BAABD', if I put (?=AA)(?=.*BA), AA must be in the first position of the test string in order for it to be a match. Then, after the first (?=...) operation is executed, the match position will be resetted. – learner May 30 '18 at 13:18
  • Not exactly that, it will still match AA and BA [if your string is 'AABABD', so that BA follows AA in order](https://pythex.org/?regex=(%3F%3DAA)(%3F%3D.*BA)&test_string=AABABD&ignorecase=0&multiline=0&dotall=0&verbose=0). As stated in the link you provided, then ' look-aheads are place-sensitive, and begin matching from where they appear within the pattern'. So you have to put in .* so that look-ahead matches whole test string, in case your second matching look-ahead in regex appears before the first one in the test string. – user59271 Jun 01 '18 at 11:21