9

Which is the fastest way to search if a string contains another string based on a list?

This one works fine, but is too slow for me when the string is large and the list is long.

test_string = "Hello! This is a test. I love to eat apples."

fruits = ['apples', 'oranges', 'bananas'] 

for fruit in fruits:
    if fruit in test_string:
        print(fruit+" contains in the string")
Kristoffer
  • 1,984
  • 3
  • 14
  • 29
  • the fruits list should be smaller than the test_string so maybe you should iterate through the test_string not the fruits list – Mulham Jarjanazi Oct 04 '19 at 14:24
  • 1
    Possible duplicate of [Python - Fastest way to check if a string contains specific characters in any of the items in a list](https://stackoverflow.com/questions/14411633/python-fastest-way-to-check-if-a-string-contains-specific-characters-in-any-of) – Georgy Oct 04 '19 at 14:29

6 Answers6

9

For this I'd suggest firstly tokenize the string with RegexpTokenizer to remove all special characters and then use sets to find the intersection:

from nltk.tokenize import RegexpTokenizer
test_string = "Hello! This is a test. I love to eat apples."

tokenizer = RegexpTokenizer(r'\w+')
test_set = set(tokenizer.tokenize(test_string))
# {'Hello', 'I', 'This', 'a', 'apples', 'eat', 'is', 'love', 'test', 'to'}

Having tokenized the string and constructed a set find the set.intersection:

set(['apples', 'oranges', 'bananas']) & test_set
# {'apples'}
yatu
  • 86,083
  • 12
  • 84
  • 139
  • 2
    I wouldn't say this is so simple, specially when it comes to optimizing performance. OP is having issues with performance here, and using sets to compute the intersection will be a much better option than performing `n` times (`n` being the length of `fruits`) a check for membership with the original string. And second there's the complication of the punctuation signs, which has to be dealt with in the case of splitting into separate words to construct a set. Feel free to post a better performing answer for such a problem @Tushortz – yatu Oct 04 '19 at 14:42
  • Maybe in this case the overkill is in requiring nltk and RegExpTokenizer - since the regexp is simple, regular stdlib's `re.findall` can do the job (although I don't know how it compares with nltk tokenizer speed wise). Other than that, the set intersection thing is certainly an excelent approach, and likely the optimal one. – jsbueno Oct 04 '19 at 17:52
2

Yes. you can decrease your iterations like this :

print(any(fruit in frozenset(test_string.replace('.',' ').lower().split()) for fruit in fruits))
Sunil Khatri
  • 379
  • 2
  • 8
1

Sets are probably your best bet for speed when using the in operator.

For building a set containing only words, we need to:

1) remove the punctuation from the string;

2) split the string in whitespaces.

For removing punctuation, this answer probably has the fastest solution (using str.makestrans and string.punctuation).

Here's an example using your test case:

import string

test_string = "Hello! This is a test. I love to eat apples."
test_string_no_punctuation = test_string.translate(str.maketrans('', '', string.punctuation))
word_set = set(test_string_no_punctuation.split())

fruits = ['apples', 'oranges', 'bananas'] 

for fruit in fruits:
    if fruit in word_set:
        print(fruit+" contains in the string")

You might want to wrap the verbose operations of removing punctuations + splitting the string into a function:

def word_set(input_string):
    return set(input_string.translate(str.maketrans('', '', string.punctuation)).split())
jfaccioni
  • 7,099
  • 1
  • 9
  • 25
1

the text is usually bigger than the list of words you are searching for.


for fruit in fruits:
    if fruit in test_string:
        print(fruit+" contains in the string")

this is really inefficient because you are actually looping over the whole text for each fruit in the fruits list, it may not be a problem for short sentences but if you were searching long texts this process would take so much longer.

a better way is to iterate through the text one time and catch all words that are in the fruits list along the way.

0

If you're only interested if a word is present:

>>> words = set(test_string.replace('.',' ').lower().split())
>>> any(fruit in words for fruit in fruits)
True

You could of course loop over each fruit to check which ones can be found in the fruitcake. So you'd change if fruit in test_string to if fruit in words in your loop example.

Jonas Byström
  • 25,316
  • 23
  • 100
  • 147
0

You could do something like this:

import re

fruits = ['apples', 'oranges', 'bananas']
test_string = "Hello! This is a test. I love to eat apples."

basket = set(fruits)
words = re.compile('\w+')

for match in words.finditer(test_string):
    fruit = match.group()
    if fruit in basket:
        print(fruit + " contains in the string")

Output

apples contains in the string
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76