0

There's a bunch of similar questions that have the same solution: how do I check my list of strings against a larger string and see if there's a match? How to check if a string contains an element from a list in Python How to check if a line has one of the strings in a list?

I have a different problem: how do I check my list of strings against a larger string, see if there's a match, and isolate the string so I can perform another string operation relative to the matched string?

Here's some sample data:

| id     | data                |
|--------|---------------------|
| 123131 | Bear Cat Apple Dog  |
| 123131 | Cat Ap.ple Mouse    |
| 231321 | Ap ple Bear         |
| 231321 | Mouse Ap ple Dog    |

Ultimately, I'm trying to find all instances of "apple" ['Apple', 'Ap.ple', 'Ap ple'] and, while it doesn't really matter which one is matched, I need to be able to find out if Cat or Bear exist before it or after it. Position of the matched string does not matter, only an ability to determine what is before or after it.

In Bear Cat Apple Dog Bear is before Apple, even though Cat is in the way.

Here's where I am at with my sample code:

data = [[123131, "Bear Cat Apple Dog"], ['123131', "Cat Ap.ple Mouse"], ['231321', "Ap ple Bear"], ['231321', "Mouse Ap ple Dog"]] 
df = pd.DataFrame(data, columns = ['id', 'data'])

def matching_function(m): 
     matching_strings = ['Apple', 'Ap.ple', 'Ap ple']

     if any(x in m for x in matching_strings):
          # do something to print the matched string
          return True

df["matched"] = df['data'].apply(matching_function)

Would it be simply better to just do this in regex?

Right now, the function simply returns true. But if there's a match I imagine it could also return matched_bear_before matched_bear_after or the same for Cat and fill that into the df['matched'] column.

Here's some sample output:

| id     | data                | matched |
|--------|---------------------|---------|
| 123131 | Bear Cat Apple Dog  | TRUE    |
| 123131 | Cat Ap.ple Mouse    | TRUE    |
| 231321 | Ap ple Bear         | TRUE    |
| 231321 | Mouse Ap ple Dog    | FALSE   |
kabaname
  • 265
  • 1
  • 12
  • So you want to know one of the strings appears in the text, *and* you want to extract the word immediately before and after the matching string? – shadowtalker Jun 11 '20 at 12:26
  • I would use a regex - you can test for cat and bear next to apple all in one swoop. – Eric Truett Jun 11 '20 at 12:28
  • Yes, I'd like to know if one of the strings appears in a row, and first check to see if Bear or Cat exists before, then check to see if either exist after – kabaname Jun 11 '20 at 12:29
  • @kabaname in "Bear Cat Apple Dog" - what would the answer be? – shadowtalker Jun 11 '20 at 12:30
  • That would depend on the order of the checking! If I check for Apple, then Bear, then Cat. Then it would print the matched strings in that order. Different order if Cat then Bear, for example. I didn't add examples of other types of "bear" or "cat" as I figured the example implemented solely on "apple" could be reproduced for any string. – kabaname Jun 11 '20 at 12:36
  • I just want to understand what you mean by "before" – shadowtalker Jun 11 '20 at 12:38
  • @shadowtalker I see. Before apple is what I mean, so in "Bear Cat Apple Dog" Bear is before Apple, even though Cat is in the way. Is that what you mean? – kabaname Jun 11 '20 at 12:39
  • Yes, that helps. You should edit that into your question. – shadowtalker Jun 11 '20 at 12:42
  • @shadowtalker I will. Thanks! – kabaname Jun 11 '20 at 12:43
  • Are you only interested in "Bear" and "Cat, or are you interested in *all* other terms that appear before and after? – shadowtalker Jun 11 '20 at 12:43
  • Only Bear and Cat. Everything else can be ignored. The idea is to use terms to selectively check and filter rows so that they can be classified. – kabaname Jun 11 '20 at 12:45
  • Are search terms always separated by whitespace? Other than the whitespace in "Apple" itself. – shadowtalker Jun 11 '20 at 12:48
  • yes, they are always separated by whitespace! – kabaname Jun 11 '20 at 12:50
  • @kabaname Can you include your expected output for a given dataframe in your question? – Shubham Sharma Jun 11 '20 at 13:03
  • 1
    I've adjusted the sample data and sample output to reflect what I would be looking for, for starters. The function simply has to return true if there's a match before and/or after. The critical ability I'm looking for, however, is to simply be able to recognize a string and look for something before or after it. – kabaname Jun 11 '20 at 13:10

3 Answers3

1

You can use the following pattern to check whether either Cat or Bear appear before the word of interest, in this case Apple or Ap.ple or Ap ple.

^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)

To create the new dataframe column which satisfies the condition, you can combine map and df.str.match:

>>> df['matched'] = list(map(lambda m: "True" if m else "False", df['data'].str.match('^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)')))

or using numpy.where:

>>> df['matched'] = numpy.where(df['data'].str.match('^(?:Cat|Bear).*Ap[. ]*ple|Ap[. ]*ple.*(?:Cat|Bear)'),'True','False')

will result in:

>>> df
       id                data matched
0  123131  Bear Cat Apple Dog    True
1  123131    Cat Ap.ple Mouse    True
2  231321         Ap ple Bear    True
3  231321    Mouse Ap ple Dog   False
Paolo
  • 21,270
  • 6
  • 38
  • 69
0

Use, Series.str.extract to extract the three new columns from the df['data'] column i.e. key, before & after, then use series.str.findall on each of the before & after columns to find all the matching before and after words:

import re

keys = ['Apple', 'Ap.ple', 'Ap ple']
markers = ['Cat', 'Bear']

p =  r'(?P<before>.*?)' + r'(?P<key>' +'|'.join(rf'\b{re.escape(k)}\b' for k in keys) + r')' + r'(?P<after>.*)'
m = '|'.join(markers)

df[['before', 'key', 'after']] = df['data'].str.extract(p)
df['before'] = df['before'].str.findall(m)
df['after'] = df['after'].str.findall(m)

df['matched'] = df['before'].str.len().gt(0) | df['after'].str.len().gt(0)

# print(df)

       id                data       before     key   after  matched
0  123131  Bear Cat Apple Dog  [Bear, Cat]   Apple      []     True
1  123131    Cat Ap.ple Mouse        [Cat]  Ap.ple      []     True
2  231321         Ap ple Bear           []  Ap ple  [Bear]     True
3  231321    Mouse Ap ple Dog           []  Ap ple      []    False
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
0

python regex find matched string from list

modified your function using regex/walrus operator to simplify:

def matching_function(m): 
    matching_strings = ['Apple', 'Ap.ple', 'Ap ple']
    if bool(results := re.search('|'.join(matching_strings), m)):
        print(results[0])
        return True
grantr
  • 878
  • 8
  • 16