24

I have a list of strings in which I want to filter for strings that contains keywords.

I want to do something like:

fruit = re.compile('apple', 'banana', 'peach', 'plum', 'pinepple', 'kiwi']

so I can then use re.search(fruit, list_of_strings) to get only the strings containing fruits, but I'm not sure how to use a list with re.compile. Any suggestions? (I'm not set on using re.compile, but I think regular expressions would be a good way to do this.)

miku
  • 181,842
  • 47
  • 306
  • 310
user808545
  • 1,551
  • 4
  • 15
  • 15

5 Answers5

55

You need to turn your fruit list into the string apple|banana|peach|plum|pineapple|kiwi so that it is a valid regex. The following should do this for you:

fruit_list = ['apple', 'banana', 'peach', 'plum', 'pineapple', 'kiwi']
fruit = re.compile('|'.join(fruit_list))

As ridgerunner pointed out in comments, you will probably want to add word boundaries to the regex, otherwise the regex will match on words like plump since they have a fruit as a substring.

fruit = re.compile(r'\b(?:%s)\b' % '|'.join(fruit_list))

Lastly, if the strings in fruit_list could contain special characters, you will probably want to use re.escape.

'|'.join(map(re.escape, fruit_list))
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • 6
    +1 But I would add word boundaries like so: `fruit = re.compile('\\b(?:'+ '|'.join(fruit_list +')\\b'))` – ridgerunner Jul 19 '11 at 17:06
  • @ridgerunner - Good point! In fact the way it is written now 'pineapple' in the string would always match 'apple', adding word boundaries to my answer. – Andrew Clark Jul 19 '11 at 17:09
  • 2
    Depending on what your list of strings is you may need tp escape them: fruit = re.compile(r'\b(?:%s)\b' % '|'.join([re.escape(x) for x in fruit_list])) – havlock Jun 23 '17 at 15:32
  • Can you provide the same with Python 3.x – Rahul Agarwal Aug 27 '18 at 08:49
  • By using .search() is it possible to know which item was matched? – George May 22 '19 at 13:58
  • @RahulAgarwal It works in Python 3. Did you get an error or something? – wjandrea Aug 10 '23 at 20:41
  • @George For static strings like this, you can get [`.group(0)` from the resulting `Match` object](//docs.python.org/3/library/re.html#re.Match.group). – wjandrea Aug 10 '23 at 20:44
7

As you want exact matches, no real need for regex imo...

fruits = ['apple', 'cherry']
sentences = ['green apple', 'yellow car', 'red cherry']
for s in sentences:
    if any(f in s for f in fruits):
        print s, 'contains a fruit!'
# green apple contains a fruit!
# red cherry contains a fruit!

EDIT: If you need access to the strings that matched:

from itertools import compress

fruits = ['apple', 'banana', 'cherry']
s = 'green apple and red cherry'

list(compress(fruits, (f in s for f in fruits)))
# ['apple', 'cherry']
mhyfritz
  • 8,342
  • 2
  • 29
  • 29
  • In this scenario, regex is more efficient than doing several separate substring tests. – Andrew Clark Jul 19 '11 at 16:24
  • @Andrew: depends on the number of fruits and sentences, and even so we are talking 2x in a matter of milliseconds. –  Jul 19 '11 at 16:46
  • @hop - I am pretty confident regex will be faster regardless of number of fruits or sentences. With regex you also have access to the fruit that was matched. – Andrew Clark Jul 19 '11 at 17:27
  • @Andrew: Re efficiency: noted. Re access to matches: that's easy, check my update. – mhyfritz Jul 19 '11 at 18:32
  • 1
    @Andrew: i will not dispute that regex are faster, but the non-regex solution might be sufficent on small data sets and easier to understand, especially if you have troubles with regex anyway. –  Jul 19 '11 at 20:26
2

Pyhton 3.x Update:

fruit_list = ['apple', 'banana', 'peach', 'plum', 'pineapple', 'kiwi']
fruit = re.compile(r'\b(?:{0})\b'.format('|'.join(fruit_list))
Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
2

You can create one regular expression, which will match, when any of the terms is found:

>>> s, t = "A kiwi, please.", "Strawberry anyone?"
>>> import re
>>> pattern = re.compile('apple|banana|peach|plum|pineapple|kiwi', re.IGNORECASE)
>>> pattern.search(s)
<_sre.SRE_Match object at 0x10046d4a8>
>>> pattern.search(t) # won't find anything
miku
  • 181,842
  • 47
  • 306
  • 310
2

Code:

fruits =  ['apple', 'banana', 'peach', 'plum', 'pinepple', 'kiwi'] 
fruit_re = [re.compile(fruit) for fruit in fruits]
fruit_test = lambda x: any([pattern.search(x) for pattern in fruit_re])

Example usage:

fruits_veggies = ['this is an apple', 'this is a tomato']
return [fruit_test(str) for str in fruits_veggies]

Edit: I realized Andrew's solution is better. You could improve fruit_test with Andrew's regular expression as

fruit_test = lambda x: andrew_re.search(x) is None
GeneralBecos
  • 2,476
  • 2
  • 22
  • 32