Search a list of strings for any sub-string from another list

Question

Given these 3 lists of data and a list of keywords:

good_data1 = ['hello, world', 'hey, world']
good_data2 = ['hey, man', 'whats up']
bad_data = ['hi, earth', 'sup, planet']
keywords = ['world', 'he']

I'm trying to write a simple function to check if any of the keywords exist as a substring of any word in the data lists. It should return True for the good_data lists and False for bad_data.

I know how to do this in what seems to be an inefficient way:

def checkData(data):
  for s in data:
    for k in keywords:
      if k in s:
        return True
  return False

score 38 · Answer 1 · answered Apr 14 '09 at 21:05

38

Are you looking for

any( k in s for k in keywords )

It's more compact, but might be less efficient.

answered Apr 14 '09 at 21:05

S.Lott

384,516
81
508
779

1

While it is more compact, it is much less readable. Very Perlish IMHO. – Shane C. Mason Apr 14 '09 at 21:09
+1 this is very pythonic and clean for small lists of keywords – Jarret Hardie Apr 14 '09 at 23:25
-1: It's compact but this does not actually answer the question (he asked for substring match, not an exact match). This works though: any(k in s for s in data for k in keywords) – mhawke Apr 15 '09 at 00:10
@mhawke: I believe the intent is to place the code within the "for s in data" loop - note that the variable he's searching is named "s" as in the inner loop, not "data", the list. Put there, it is a substring search. IMHO using an outer loop is more readable than a double-looping comprehension. – Brian Apr 15 '09 at 10:07
Thanks S.Lott! I'm using this answer in combo with gnud's: 's' is the joined string. I'm accepting gnud's as the answer since he thought of joining the array, and also a much lower rank. – Jason Coon Apr 15 '09 at 12:22
jcoon - what was your final implementation? Could you post it? – Frederick Feb 05 '15 at 13:51

score 16 · Accepted Answer · answered Apr 14 '09 at 21:23

16

In your example, with so few items, it doesn't really matter. But if you have a list of several thousand items, this might help.

Since you don't care which element in the list contains the keyword, you can scan the whole list once (as one string) instead of one item at the time. For that you need a join character that you know won't occur in the keyword, in order to avoid false positives. I use the newline in this example.

def check_data(data):
    s = "\n".join(data);
    for k in keywords:
        if k in s:
            return True

    return False

In my completely unscientific test, my version checked a list of 5000 items 100000 times in about 30 seconds. I stopped your version after 3 minutes -- got tired of waiting to post =)

answered Apr 14 '09 at 21:23

gnud

77,584
5
64
78

Nice solution, I was thinking of exactly that. It should be pretty fast compared to regexes, because the __contains__ check is done with a modified Boyer-Moore and thus should skip nicely over the separator chars. – Torsten Marek Apr 14 '09 at 21:47
Nice idea, and it does have a *very* signigicant effect on large lists - probably as Torsten said because python can reuse the search tables it creates instead of throwing them away and recreating them for every list item. I'll update my timing figures. – Brian Apr 14 '09 at 21:57
Joining the array is a great idea! I decided to use this in combo with the S.Lott's suggestion on the any() function. Thanks! – Jason Coon Apr 15 '09 at 12:23
That was brilliant, really smart, nice job bro. – AliBZ Jul 22 '13 at 23:18

score 4 · Answer 3 · answered Apr 14 '09 at 21:40

4

If you have many keywords, you might want to try a suffix tree [1]. Insert all the words from the three data lists, storing which list each word comes from in it's terminating node. Then you can perform queries on the tree for each keyword really, really fast.

Warning: suffix trees are very complicated to implement!

[1] http://en.wikipedia.org/wiki/Suffix_tree

answered Apr 14 '09 at 21:40

moinudin

134,091
45
190
216

Though note that regexes will do effectively the same with a finite state machine. I suspect it would perform similarly, except slower by a constant factor due to being coded in python vs C. – Brian Apr 14 '09 at 22:21
Most regex libraries are *not* implemented using fsm's. This is true of any regex library that allows for backreferences. Otherwise you would be perfectly correct. Good point nonetheless. – moinudin Apr 14 '09 at 22:35
@marcog. Good point, and it looks like I'm wrong anyway. I just looked at re.compile("abc|abd|abb|acb",re.DEBUG), and there's very little sharing of branches going on, which indicates that it's probably backtracking the whole way back to the start of each word anyway. – Brian Apr 14 '09 at 23:12

score 2 · Answer 4 · edited May 23 '17 at 12:06

You may be able to improve matters by building your list of keywords as a regular expression.

This may allow them to be tested in parallel, but will very much depend on what the keywords are (eg. some work may be reused testing for "hello" and "hell", rather than searching every phrase from the start for each word.

You could do this by executing:

import re
keyword_re = re.compile("|".join(map(re.escape, keywords)))

Then:

>>> bool(keyword_re.search('hello, world'))
True
>>> bool(keyword_re.search('hi, earth'))
False

(It will actually return a match object on found, and None if not found - this might be useful if you need to know which keyword matched)

However, how much (if anything) this gains you will depend on the keywords. If you only have one or two, keep your current approach. If you have a large list, it may be worth tring and profiling to see which performs better.

[Edit] For reference, here's how the approaches do for your example:

               good1   good2  good3   bad1   bad2
original     : 0.206   0.233  0.229   0.390   63.879
gnud (join)  : 0.257   0.347  4.600   0.281    6.706
regex        : 0.766   1.018  0.397   0.764  124.351
regex (join) : 0.345   0.337  3.305   0.481   48.666

Obviously for this case, your approach performs far better than the regex one. Whether this will always be the case depends a lot on the number and complexity of keywords, and the input data that will be checked. For large numbers of keywords, and lengthy lists or rarely matching phrases, regexes may work better, but do get timing information, and perhaps try even simpler optimisations (like moving the most common words to the front of your keyword list) first. Sometimes the simplest approach really is the best.

[Edit2] Updated the table with gnud's solution, and a similar approach before applying the regexes. I also added 2 new tests:

good_data3 = good_data2 * 500  # 1000 items, the first of which matches.
bad_data2 = bad_data * 500     # 1000 items, none of which matches.

Which show up the various strengths and weaknesses. Joining does do worse when a match would immediately be found (as there is an always paid, up-front cost in joining the list - this is a best possible case for the linear search method), however for non-matching lists, it performs better. Much better when there are a large number of items in the list.case).

Thanks for explaining this Brian. – Jason Coon Apr 15 '09 at 12:26 — Jason Coon, Apr 15 '09 at 12:26

score 0 · Answer 5 · answered Apr 14 '09 at 21:07

0

I think this is pretty efficient and clear, though you could use map() to avoid the many nests. I agree with ross on the dictionary idea for larger lists.

answered Apr 14 '09 at 21:07

Shane C. Mason

7,518
3
26
33

Search a list of strings for any sub-string from another list

5 Answers5

Linked