13

Given a list ["one", "two", "three"], how to determine if each word exist in a specified string?

The word list is pretty short (in my case less than 20 words), but the strings to be searched is pretty huge (400,000 strings for each run)

My current implementation uses re to look for matches but I'm not sure if it's the best way.

import re
word_list = ["one", "two", "three"]
regex_string = "(?<=\W)(%s)(?=\W)" % "|".join(word_list)

finder = re.compile(regex_string)
string_to_be_searched = "one two three"

results = finder.findall(" %s " % string_to_be_searched)
result_set = set(results)
for word in word_list:
    if word in result_set:
        print("%s in string" % word)

Problems in my solution:

  1. It will search until the end of the string, although the words may appear in the first half of the string
  2. In order to overcome the limitation of lookahead assertion (I don't know how to express "the character before current match should be non-word characters, or the start of the string"), I added extra space before and after the string I need to be searched.
  3. Other performance issue introduced by the lookahead assertion?

Possible simpler implementation:

  1. just loop through the word list and do a if word in string_to_be_searched. But it can not deal with "threesome" if you are looking for "three"
  2. Use one regular expression search for one word. Still I'm not sure about the performance, and the potential of searching string multiple times.

UPDATE:

I've accepted Aaron Hall's answer https://stackoverflow.com/a/21718896/683321 because according to Peter Gibson's benchmark https://stackoverflow.com/a/21742190/683321 this simple version has the best performance. If you are interested in this problem, you can read all the answers and get a better view.

Actually I forgot to mention another constraint in my original problem. The word can be a phrase, for example: word_list = ["one day", "second day"]. Maybe I should ask another question.

Community
  • 1
  • 1
yegle
  • 5,795
  • 6
  • 39
  • 61
  • why not just split the word in the string_to_be_searched and put them in the dict ,and iterate words in the search list to determine – michaeltang Feb 12 '14 at 04:12
  • @michaeltang this would be great if you had to search that string_to_be_searched a lot, but constructing a dictionary to do an O(1) lookup once isn't amazing.... – Adam Smith Feb 12 '14 at 04:22
  • I believe my regular expression solution (http://stackoverflow.com/questions/21718345/python-how-to-determine-if-a-list-of-words-exist-in-a-string/21719831#21719831) would work for your additional constraint: it's 4 times slower, even if it is the 2nd fastest, but the fastest solution would not work for that. It's probably not a good idea to recycle your question with one additional constraint, but I could be wrong there. – Russia Must Remove Putin Feb 13 '14 at 01:16

10 Answers10

17

This function was found by Peter Gibson (below) to be the most performant of the answers here. It is good for datasets one may hold in memory (because it creates a list of words from the string to be searched and then a set of those words):

def words_in_string(word_list, a_string):
    return set(word_list).intersection(a_string.split())

Usage:

my_word_list = ['one', 'two', 'three']
a_string = 'one two three'
if words_in_string(my_word_list, a_string):
    print('One or more words found!')

Which prints One or words found! to stdout.

It does return the actual words found:

for word in words_in_string(my_word_list, a_string):
    print(word)

Prints out:

three
two
one

For data so large you can't hold it in memory, the solution given in this answer would be very performant.

Community
  • 1
  • 1
Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
  • Slick, but it needs to indicate each word from a_list found in a_string, not just that a single case was. –  Feb 12 '14 at 05:13
  • @JohnPirie I wasn't sure exactly what the requestor was asking for, but what you say it needs, it does! :D – Russia Must Remove Putin Feb 12 '14 at 05:19
  • 1
    I found this to be the fastest solution in my testing (see my new post) and the simplicity is certainly appealing - well done – Peter Gibson Feb 12 '14 at 23:35
  • @PeterGibson Thank you! I didn't have a chance to benchmark, did you have an opinion of my generator approach? I suppose it's not fully implemented, though, but to be fair, if the string is infinitely long and one word is never found, the search would never complete: http://stackoverflow.com/questions/21718345/python-how-to-determine-if-a-list-of-words-exist-in-a-string/21719831#21719831 – Russia Must Remove Putin Feb 12 '14 at 23:41
  • 1
    Yes, it was slower than this, but still one of the faster solutions. Check out the results http://stackoverflow.com/a/21742190/66349 – Peter Gibson Feb 12 '14 at 23:42
  • So you're saying I took first and second place over 5 other entries? Sweet. Thanks again! :D – Russia Must Remove Putin Feb 12 '14 at 23:45
  • I suspect this solution would degrade with string length, but for strings 400,000 chars long it worked great – Peter Gibson Feb 12 '14 at 23:46
  • Agreed, that was my leading comment. I expect my other solution would not degrade with length, though, which is why I posted it. – Russia Must Remove Putin Feb 12 '14 at 23:48
  • it is showing degraded performance when using with ProcessPoolExecutor or ThreadPoolExecutor. sample lines from google 5gram (i have 100gb file) 1. word1 w2 w3 w4 w5 2008 2 2 2. NP no world trees fruits 1975 1 1 test done on 119288 lines above code took 0.25 seconds aprox, parallel took 89.56 seconds futures = [] my_word_list = ['banking', 'members', 'based', 'hardness'] with ProcessPoolExecutor() as pe: for line in lines: ff = pe.submit(words_in_string,my_word_list, line) futures.append(ff) results = [f.result() for f in futures] – bommina Jul 08 '20 at 13:30
6

To satisfy my own curiosity, I've timed the posted solutions. Here are the results:

TESTING: words_in_str_peter_gibson          0.207071995735
TESTING: words_in_str_devnull               0.55300579071
TESTING: words_in_str_perreal               0.159866499901
TESTING: words_in_str_mie                   Test #1 invalid result: None
TESTING: words_in_str_adsmith               0.11831510067
TESTING: words_in_str_gnibbler              0.175446796417
TESTING: words_in_string_aaron_hall         0.0834425926208
TESTING: words_in_string_aaron_hall2        0.0266295194626
TESTING: words_in_str_john_pirie            <does not complete>

Interestingly @AaronHall's solution

def words_in_string(word_list, a_string):
    return set(a_list).intersection(a_string.split())

which is the fastest, is also one of the shortest! Note it doesn't handle punctuation next to words, but it's not clear from the question whether that is a requirement. This solution was also suggested by @MIE and @user3.

I didn't look very long at why two of the solutions did not work. Apologies if this is my mistake. Here is the code for the tests, comments & corrections are welcome

from __future__ import print_function
import re
import string
import random
words = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']

def random_words(length):
    letters = ''.join(set(string.ascii_lowercase) - set(''.join(words))) + ' '
    return ''.join(random.choice(letters) for i in range(int(length)))

LENGTH = 400000
RANDOM_STR = random_words(LENGTH/100) * 100
TESTS = (
    (RANDOM_STR + ' one two three', (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR + ' one two three four five six seven eight nine ten', (
        ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten'],
        set(['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']),
        True,
        [True] * 10,
        {'one': True, 'two': True, 'three': True, 'four': True, 'five': True, 'six': True,
            'seven': True, 'eight': True, 'nine': True, 'ten':True}
        )),

    ('one two three ' + RANDOM_STR, (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR, (
        [],
        set(),
        False,
        [False] * 10,
        {'one': False, 'two': False, 'three': False, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    (RANDOM_STR + ' one two three ' + RANDOM_STR, (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    ('one ' + RANDOM_STR + ' two ' + RANDOM_STR + ' three', (
        ['one', 'two', 'three'],
        set(['one', 'two', 'three']),
        False,
        [True] * 3 + [False] * 7,
        {'one': True, 'two': True, 'three': True, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    ('one ' + RANDOM_STR + ' two ' + RANDOM_STR + ' threesome', (
        ['one', 'two'],
        set(['one', 'two']),
        False,
        [True] * 2 + [False] * 8,
        {'one': True, 'two': True, 'three': False, 'four': False, 'five': False, 'six': False,
            'seven': False, 'eight': False, 'nine': False, 'ten':False}
        )),

    )

def words_in_str_peter_gibson(words, s):
    words = words[:]
    found = []
    for match in re.finditer('\w+', s):
        word = match.group()
        if word in words:
            found.append(word)
            words.remove(word)
            if len(words) == 0: break
    return found

def words_in_str_devnull(word_list, inp_str1):
    return dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str1))) for word in word_list)


def words_in_str_perreal(wl, s):
    i, swl, strwords = 0, sorted(wl), sorted(s.split())
    for w in swl:
        while strwords[i] < w:  
            i += 1
            if i >= len(strwords): return False
        if w != strwords[i]: return False
    return True

def words_in_str_mie(search_list, string):
    lower_string=string.lower()
    if ' ' in lower_string:
        result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
        substr=lower_string[:lower_string.find(' ')]
        if substr in search_list and substr not in result:
            result+=substr
        substr=lower_string[lower_string.rfind(' ')+1:]
        if substr in search_list and substr not in result:
            result+=substr
    else:
        if lower_string in search_list:
            result=[lower_string]

def words_in_str_john_pirie(word_list, to_be_searched):
    for word in word_list:
        found = False
        while not found:
            offset = 0
            # Regex is expensive; use find
            index = to_be_searched.find(word, offset)
            if index < 0:
                # Not found
                break
            if index > 0 and to_be_searched[index - 1] != " ":
                # Found, but substring of a larger word; search rest of string beyond
                offset = index + len(word)
                continue
            if index + len(word) < len(to_be_searched) \
                    and to_be_searched[index + len(word)] != " ":
                # Found, but substring of larger word; search rest of string beyond
                offset = index + len(word)
                continue
            # Found exact word match
            found = True    
    return found

def words_in_str_gnibbler(words, string_to_be_searched):
    word_set = set(words)
    found = []
    for match in re.finditer(r"\w+", string_to_be_searched):
        w = match.group()
        if w in word_set:
             word_set.remove(w)
             found.append(w)
    return found

def words_in_str_adsmith(search_list, big_long_string):
    counter = 0
    for word in big_long_string.split(" "):
        if word in search_list: counter += 1
        if counter == len(search_list): return True
    return False

def words_in_string_aaron_hall(word_list, a_string):
    def words_in_string(word_list, a_string):
        '''return iterator of words in string as they are found'''
        word_set = set(word_list)
        pattern = r'\b({0})\b'.format('|'.join(word_list))
        for found_word in re.finditer(pattern, a_string):
            word = found_word.group(0)
            if word in word_set:
                word_set.discard(word)
                yield word
                if not word_set:
                    raise StopIteration
    return list(words_in_string(word_list, a_string))

def words_in_string_aaron_hall2(word_list, a_string):
    return set(word_list).intersection(a_string.split())

ALGORITHMS = (
        words_in_str_peter_gibson,
        words_in_str_devnull,
        words_in_str_perreal,
        words_in_str_mie,
        words_in_str_adsmith,
        words_in_str_gnibbler,
        words_in_string_aaron_hall,
        words_in_string_aaron_hall2,
        words_in_str_john_pirie,
        )

def test(alg):
    for i, (s, possible_results) in enumerate(TESTS):
        result = alg(words, s)
        assert result in possible_results, \
            'Test #%d invalid result: %s ' % (i+1, repr(result))

COUNT = 10
if __name__ == '__main__':
    import timeit
    for alg in ALGORITHMS:
        print('TESTING:', alg.__name__, end='\t\t')
        try:
            print(timeit.timeit(lambda: test(alg), number=COUNT)/COUNT)
        except Exception as e:
            print(e)
Peter Gibson
  • 19,086
  • 7
  • 60
  • 64
  • Amazing fact, thank you for the test and comparison. I'm getting similar result to yours. – yegle Feb 13 '14 at 00:38
2

Easy way:

filter(lambda x:x in string,search_list)

if you want the search to ignore character's case you can do this:

lower_string=string.lower()
filter(lambda x:x.lower() in lower_string,search_list)

if you want to ignore words that are part of bigger word such as three in threesome:

lower_string=string.lower()
result=[]
if ' ' in lower_string:
    result=filter(lambda x:' '+x.lower()+' ' in lower_string,search_list)
    substr=lower_string[:lower_string.find(' ')]
    if substr in search_list and substr not in result:
        result+=[substr]
    substr=lower_string[lower_string.rfind(' ')+1:]
    if substr in search_list and substr not in result:
        result+=[substr]
else:
    if lower_string in search_list:
        result=[lower_string]


If performance is needed:
arr=string.split(' ')
result=list(set(arr).intersection(set(search_list)))

EDIT: this method was the fastest in an example that searches for 1,000 words in a string containing 400,000 words but if we increased the string to be 4,000,000 the previous method is faster.


if string is too long you should do low level search and avoid converting it to list:
def safe_remove(arr,elem):
    try:
        arr.remove(elem)
    except:
        pass

not_found=search_list[:]
i=string.find(' ')
j=string.find(' ',i+1)
safe_remove(not_found,string[:i])
while j!=-1:
    safe_remove(not_found,string[i+1:j])
    i,j=j,string.find(' ',j+1)
safe_remove(not_found,string[i+1:])

not_found list contains words that are not found, you can get the found list easily, one way is list(set(search_list)-set(not_found))

EDIT: the last method appears to be the slowest.

MIE
  • 444
  • 2
  • 9
  • 1
    it can not deal with "threesome" if you are looking for "three" ? – michaeltang Feb 12 '14 at 04:11
  • I've timed each of the posted solutions, but I couldn't get yours to complete all of the tests - it returns None for one of the tests. If you would care to take a look and fix it (or tell me what's wrong with my end) I'll update the results. Cheers. stackoverflow.com/a/21742190/66349 – Peter Gibson Feb 13 '14 at 01:03
  • @PeterGibson first method edited also the first is faster given that the string is more than four million words – MIE Feb 13 '14 at 04:53
1
def words_in_str(s, wl):
    i, swl, strwords = 0, sorted(wl), sorted(s.split())
    for w in swl:
        while strwords[i] < w:  
            i += 1
            if i >= len(strwords): return False
        if w != strwords[i]: return False
    return True
perreal
  • 94,503
  • 21
  • 155
  • 181
  • This seems promising...Maybe replace the `string.split` with one of the generator version at http://stackoverflow.com/questions/3862010/is-there-a-generator-version-of-string-split-in-python – yegle Feb 12 '14 at 04:28
  • @yegle, but then it will be hard to do a sorted generator version? – perreal Feb 12 '14 at 04:32
1

You can try this:

list(set(s.split()).intersection(set(w)))

It return only matched words from your word list. If no words matched, it would return empty list.

venpa
  • 4,268
  • 21
  • 23
0

If your string is long and your search list is short, do this:

def search_string(big_long_string,search_list)
    counter = 0
    for word in big_long_string.split(" "):
        if word in search_list: counter += 1
        if counter == len(search_list): return True
    return False
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
0

You could make use of word boundaries:

>>> import re
>>> word_list = ["one", "two", "three"]
>>> inp_str = "This line not only contains one and two, but also three"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
...   print "Found all words in the list"
...
Found all words in the list
>>> inp_str = "This line not only contains one and two, but also threesome"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
...   print "Found all words in the list"
...
>>> inp_str = "This line not only contains one and two, but also four"
>>> if all(re.search(r'\b{}\b'.format(re.escape(word)), inp_str) for word in word_list):
...   print "Found all words in the list"
...
>>>

EDIT: As indicated in your comment, you seem to be looking for a dictionary instead:

>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str1))) for word in word_list)
{'three': True, 'two': True, 'one': True}
>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str2))) for word in word_list)
{'three': False, 'two': True, 'one': True}
>>> dict((word, bool(re.search(r'\b{}\b'.format(re.escape(word)), inp_str3))) for word in word_list)
{'three': False, 'two': True, 'one': True}
devnull
  • 118,548
  • 33
  • 236
  • 227
  • +1 but using `str` as a variable name is a bad idea. – thefourtheye Feb 12 '14 at 04:18
  • It would be interesting to compare this against a single regex with the search terms 'OR'd together using `|` as in the question – Peter Gibson Feb 12 '14 at 04:21
  • @PeterGibson It will not match all the words, even if one word matches it will return the match. – thefourtheye Feb 12 '14 at 04:27
  • 1
    I'm not actually looking for a single `bool` value, instead I'm looking for a dict mapping `word` to `bool`. Besides, I may need to run some test and see the performance of running `re.search` multiple times and run `re.findall` once. – yegle Feb 12 '14 at 04:30
  • @thefourtheye yes but it will possibly search completely through the input string multiple times before finding a match - I suspect it is more efficient to only iterate once through the input string (just a hunch though) – Peter Gibson Feb 12 '14 at 04:33
  • @PeterGibson mmmm. I doubt that. RE engine might have been implemented as a state machine. So, it doesn't have to do multiple iterations. – thefourtheye Feb 12 '14 at 05:03
  • @yegle Updated to return a dict as well. – devnull Feb 12 '14 at 05:04
  • @thefourtheye if the target words are near the end of the string, then the use of `all` means that the multiple iterations is happening outside of the `re` module – Peter Gibson Feb 12 '14 at 05:27
0

If the order isn't too important, you can use this approach

word_set = {"one", "two", "three"}
string_to_be_searched = "one two three"

for w in string_to_be_searched.split():
    if w in word_set:
         print("%s in string" % w)
         word_set.remove(w)

The .split() creates a list, which may be a problem for your 400k word string. But if you have enough RAM, you are done.

It's of course possible to modify the for loop to avoid creating the whole list. re.finditer or a generator using str.find are the obvious choices

import re
word_set = {"one", "two", "three"}
string_to_be_searched = "one two three"

for match in re.finditer(r"\w+", string_to_be_searched):
    w = match.group()
    if w in word_set:
         print("%s in string" % w)
         word_set.remove(w)
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
0

Given your comment

I'm not actually looking for a single bool value, instead I'm looking for a dict mapping word to bool. Besides, I may need to run some test and see the performance of running re.search multiple times and run re.findall once. – yegle

I would propose the following

import re
words = ['one', 'two', 'three']

def words_in_str(words, s):
    words = words[:]
    found = []
    for match in re.finditer('\w+', s):
        word = match.group()
        if word in words:
            found.append(word)
            words.remove(word)
            if len(words) == 0: break
    return found

assert words_in_str(words, 'three two one') == ['three', 'two', 'one']
assert words_in_str(words, 'one two. threesome') == ['one', 'two']
assert words_in_str(words, 'nothing of interest here one1') == []

This returns a list of words found in order, but you could easily modify it to return a dict{word:bool} as you desire.

Advantages:

  • stops searching through input string when all words are found
  • removes a word form candidates once it is found
Peter Gibson
  • 19,086
  • 7
  • 60
  • 64
0

Here's a simple generator that would be better for big strings, or a file, as I adapt it in the section below.

Note that this should be very fast, but it will continue for as long as the string continues without hitting all the words. This came in second on Peter Gibson's benchmarking: Python: how to determine if a list of words exist in a string

For a faster solution for shorter strings, see my other answer here: Python: how to determine if a list of words exist in a string


Original Answer

import re

def words_in_string(word_list, a_string):
    '''return iterator of words in string as they are found'''
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    for found_word in re.finditer(pattern, a_string):
        word = found_word.group(0)
        if word in word_set:
            word_set.discard(word)
            yield word
            if not word_set: # then we've found all words
                # break out of generator, closing file
                raise StopIteration 

It goes through the string yielding the words as it finds them, abandoning the search after it finds all the words, or if it reaches the end of the string.

Usage:

word_list = ['word', 'foo', 'bar']
a_string = 'A very pleasant word to you.'
for word in words_in_string(word_list, a_string):
    print word

word

EDIT: adaptation to use with a large file:

Thanks to Peter Gibson for finding this the second fastest approach. I'm quite proud of the solution. Since the best use-case for this is to go through a huge text stream, let me adapt the above function here to handle a file. Do note that if words are broken on newlines this will not catch them, but neither would any of the other methods here.

import re

def words_in_file(word_list, a_file_path):
    '''
    return a memory friendly iterator of words as they are found
    in a file.
    '''
    word_set = set(word_list)
    pattern = r'\b({0})\b'.format('|'.join(word_list))
    with open(a_file_path, 'rU') as a_file:
        for line in a_file:
            for found_word in re.finditer(pattern, line):
                word = found_word.group(0)
                if word in word_set:
                    word_set.discard(word)
                    yield word
                    if not word_set: # then we've found all words
                        # break out of generator, closing file
                        raise StopIteration

To demonstrate, let's write some data:

file_path = '/temp/temp/foo.txt'
with open(file_path, 'w') as f:
    f.write('this\nis\nimportant\ndata')

and usage:

word_list = ['this', 'is', 'important']
iterator = words_in_file(word_list, file_path)

we now have an iterator, and if we consume it with a list:

list(iterator)

it returns:

['this', 'is', 'important']
Community
  • 1
  • 1
Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331