Fastest way to check if a string contains a string from a list

Question

Which is the fastest way to search if a string contains another string based on a list?

This one works fine, but is too slow for me when the string is large and the list is long.

test_string = "Hello! This is a test. I love to eat apples."

fruits = ['apples', 'oranges', 'bananas'] 

for fruit in fruits:
    if fruit in test_string:
        print(fruit+" contains in the string")

the fruits list should be smaller than the test_string so maybe you should iterate through the test_string not the fruits list — Mulham Jarjanazi, Oct 04 '19 at 14:24
Possible duplicate of [Python - Fastest way to check if a string contains specific characters in any of the items in a list](https://stackoverflow.com/questions/14411633/python-fastest-way-to-check-if-a-string-contains-specific-characters-in-any-of) — Georgy, Oct 04 '19 at 14:29

yatu · Accepted Answer · 2019-10-04T15:07:20.620

9

For this I'd suggest firstly tokenize the string with RegexpTokenizer to remove all special characters and then use sets to find the intersection:

from nltk.tokenize import RegexpTokenizer
test_string = "Hello! This is a test. I love to eat apples."

tokenizer = RegexpTokenizer(r'\w+')
test_set = set(tokenizer.tokenize(test_string))
# {'Hello', 'I', 'This', 'a', 'apples', 'eat', 'is', 'love', 'test', 'to'}

Having tokenized the string and constructed a set find the set.intersection:

set(['apples', 'oranges', 'bananas']) & test_set
# {'apples'}

edited Oct 04 '19 at 15:07

answered Oct 04 '19 at 14:22

yatu

86,083
12
84
139

2

I wouldn't say this is so simple, specially when it comes to optimizing performance. OP is having issues with performance here, and using sets to compute the intersection will be a much better option than performing `n` times (`n` being the length of `fruits`) a check for membership with the original string. And second there's the complication of the punctuation signs, which has to be dealt with in the case of splitting into separate words to construct a set. Feel free to post a better performing answer for such a problem @Tushortz – yatu Oct 04 '19 at 14:42
Maybe in this case the overkill is in requiring nltk and RegExpTokenizer - since the regexp is simple, regular stdlib's `re.findall` can do the job (although I don't know how it compares with nltk tokenizer speed wise). Other than that, the set intersection thing is certainly an excelent approach, and likely the optimal one. – jsbueno Oct 04 '19 at 17:52

score 2 · Answer 2 · answered Oct 04 '19 at 15:31

2

Yes. you can decrease your iterations like this :

print(any(fruit in frozenset(test_string.replace('.',' ').lower().split()) for fruit in fruits))

answered Oct 04 '19 at 15:31

Sunil Khatri

379
2
8

jfaccioni · Answer 3 · 2019-10-04T14:30:55.857

Sets are probably your best bet for speed when using the in operator.

For building a set containing only words, we need to:

1) remove the punctuation from the string;

2) split the string in whitespaces.

For removing punctuation, this answer probably has the fastest solution (using str.makestrans and string.punctuation).

Here's an example using your test case:

import string

test_string = "Hello! This is a test. I love to eat apples."
test_string_no_punctuation = test_string.translate(str.maketrans('', '', string.punctuation))
word_set = set(test_string_no_punctuation.split())

fruits = ['apples', 'oranges', 'bananas'] 

for fruit in fruits:
    if fruit in word_set:
        print(fruit+" contains in the string")

You might want to wrap the verbose operations of removing punctuations + splitting the string into a function:

def word_set(input_string):
    return set(input_string.translate(str.maketrans('', '', string.punctuation)).split())

score 1 · Answer 4 · answered Oct 04 '19 at 14:30

the text is usually bigger than the list of words you are searching for.


for fruit in fruits:
    if fruit in test_string:
        print(fruit+" contains in the string")

this is really inefficient because you are actually looping over the whole text for each fruit in the fruits list, it may not be a problem for short sentences but if you were searching long texts this process would take so much longer.

a better way is to iterate through the text one time and catch all words that are in the fruits list along the way.

score 0 · Answer 5 · answered Oct 04 '19 at 14:40

If you're only interested if a word is present:

>>> words = set(test_string.replace('.',' ').lower().split())
>>> any(fruit in words for fruit in fruits)
True

You could of course loop over each fruit to check which ones can be found in the fruitcake. So you'd change if fruit in test_string to if fruit in words in your loop example.

Dani Mesejo · Answer 6 · 2019-10-04T15:26:21.703

0

You could do something like this:

import re

fruits = ['apples', 'oranges', 'bananas']
test_string = "Hello! This is a test. I love to eat apples."

basket = set(fruits)
words = re.compile('\w+')

for match in words.finditer(test_string):
    fruit = match.group()
    if fruit in basket:
        print(fruit + " contains in the string")

Output

apples contains in the string

edited Oct 04 '19 at 15:26

answered Oct 04 '19 at 15:19

Dani Mesejo

61,499
6
49
76

Fastest way to check if a string contains a string from a list

6 Answers6