31

I have a list of possible substrings, e.g. ['cat', 'fish', 'dog']. In practice, the list contains hundreds of entries.

I'm processing a string, and what I'm looking for is to find the index of the first appearance of any of these substrings.

To clarify, for '012cat' the result is 3, and for '0123dog789cat' the result is 4.

I also need to know which substring was found (e.g. its index in the substring list or the text itself), or at least the length of the substring matched.

There are obvious brute-force ways to achieve this, I wondered if there's any elegant Python/regex solution for this.

Georgy
  • 12,464
  • 7
  • 65
  • 73
Roee Adler
  • 33,434
  • 32
  • 105
  • 133
  • 1
    Is the list of substrings constant? I'm asking because using Regex type solutions usually involve some precomputations of the regular expression (rsp. the list of substrings in your case). Would that precomputation be amortized over many searches? – Accipitridae May 09 '09 at 07:30

6 Answers6

36

I would assume a regex is better than checking for each substring individually because conceptually the regular expression is modeled as a DFA, and so as the input is consumed all matches are being tested for at the same time (resulting in one scan of the input string).

So, here is an example:

import re

def work():
  to_find = re.compile("cat|fish|dog")
  search_str = "blah fish cat dog haha"
  match_obj = to_find.search(search_str)
  the_index = match_obj.start()  # produces 5, the index of fish
  which_word_matched = match_obj.group()  # "fish"
  # Note, if no match, match_obj is None

UPDATE: Some care should be taken when combining words in to a single pattern of alternative words. The following code builds a regex, but escapes any regex special characters and sorts the words so that longer words get a chance to match before any shorter prefixes of the same word:

def wordlist_to_regex(words):
    escaped = map(re.escape, words)
    combined = '|'.join(sorted(escaped, key=len, reverse=True))
    return re.compile(combined)

>>> r.search('smash atomic particles').span()
(6, 10)
>>> r.search('visit usenet:comp.lang.python today').span()
(13, 29)
>>> r.search('a north\south division').span()
(2, 13)
>>> r.search('012cat').span()
(3, 6)
>>> r.search('0123dog789cat').span()
(4, 7)

END UPDATE

It should be noted that you will want to form the regex (ie - call to re.compile()) as little as possible. The best case would be you know ahead of time what your searches are (or you compute them once/infrequently) and then save the result of re.compile somewhere. My example is just a simple nonsense function so you can see the usage of the regex. There are some more regex docs here:

http://docs.python.org/library/re.html

Hope this helps.

UPDATE: I am unsure about how python implements regular expressions, but to answer Rax's question about whether or not there are limitations of re.compile() (for example, how many words you can try to "|" together to match at once), and the amount of time to run compile: neither of these seem to be an issue. I tried out this code, which is good enough to convince me. (I could have made this better by adding timing and reporting results, as well as throwing the list of words into a set to ensure there are no duplicates... but both of these improvements seem like overkill). This code ran basically instantaneously, and convinced me that I am able to search for 2000 words (of size 10), and that and of them will match appropriately. Here is the code:

import random
import re
import string
import sys

def main(args):
    words = []
    letters_and_digits = "%s%s" % (string.letters, string.digits)
    for i in range(2000):
        chars = []
        for j in range(10):
            chars.append(random.choice(letters_and_digits))
        words.append(("%s"*10) % tuple(chars))
    search_for = re.compile("|".join(words))
    first, middle, last = words[0], words[len(words) / 2], words[-1]
    search_string = "%s, %s, %s" % (last, middle, first)
    match_obj = search_for.search(search_string)
    if match_obj is None:
        print "Ahhhg"
        return
    index = match_obj.start()
    which = match_obj.group()
    if index != 0:
        print "ahhhg"
        return
    if words[-1] != which:
        print "ahhg"
        return

    print "success!!! Generated 2000 random words, compiled re, and was able to perform matches."

if __name__ == "__main__":
    main(sys.argv)

UPDATE: It should be noted that the order of of things ORed together in the regex matters. Have a look at the following test inspired by TZOTZIOY:

>>> search_str = "01catdog"
>>> test1 = re.compile("cat|catdog")
>>> match1 = test1.search(search_str)
>>> match1.group()
'cat'
>>> match1.start()
2
>>> test2 = re.compile("catdog|cat")  # reverse order
>>> match2 = test2.search(search_str)
>>> match2.group()
'catdog'
>>> match2.start()
2

This suggests the order matters :-/. I am not sure what this means for Rax's application, but at least the behavior is known.

UPDATE: I posted this questions about the implementation of regular expressions in Python which will hopefully give us some insight into the issues found with this question.

binaryfunt
  • 6,401
  • 5
  • 37
  • 59
Tom
  • 21,468
  • 6
  • 39
  • 44
  • This surely works, but I have a question - isn't there a limitation on the size of the regex definition? If I have 1000 substrings, will it still work? Is there any significant performance degradation relative to the number of words (i.e. that's more than linear in the size of the list)? Regarding your other clarifications, my list of substrings is being update only once a day or so, I think it's no issue to generate the regex definition and call "compile" at this frequency. Many thanks – Roee Adler May 09 '09 at 07:55
  • @ rax did you see my new solution? I basically fixed everything about it and submitted it 20 seconds after this one. – Unknown May 09 '09 at 08:09
  • @rax: Hopefully the example code I added helps convince you the re module will be fine :-). – Tom May 09 '09 at 08:49
  • I would suggest the "|".join(words) too, but there's the issue that searching for ["cacat", "cat"] in "this cacat is a cat" returns [("cacat", 5), ("cat", 16)] instead of [("cacat", 5), ("cat", 7)], although this could be the wanted behaviour. – tzot May 09 '09 at 10:10
  • Line 7 in your first snippet should be 'match_obj.start()' – Nick Presta May 09 '09 at 17:39
  • 1
    Many so-called regular expression syntaxes are not actually "regular". That is, they are actually more powerful than true regular expressions and therefore cannot be represented as a DFA. An example of this which shows up in Python, Perl and even grep is back-references. Take the Python regex r"(a+)b\1". This matches some number of a's, a b, and then the same number of a's as before. This is non-regular. RE engines that support backreferences actually use an NFA. Some RE engines are smart enough to switch to using DFAs for regexes that are actually regular, but I don't think Python does this. – Laurence Gonsalves May 09 '09 at 18:53
  • I hate it when I make unneeded edits and then someone else with the same answer gets bumped up and gets 10 votes while i get 0. – Unknown May 09 '09 at 19:46
  • @Laurence: Good insight. I am curious to post somewhere about the implementation of REs in Python. I don't see why an NFA is used though. NFAs and DFAs are equivalent. You can use Thompson's subset construction to convert an NFA to a DFA. Did you mean you need to use a PDA so that a stack can keep track of how many a's you have seen? I'm not even sure about this because I am not completely sure about the syntax... but I am sure NFAs and DFAs are equivalent. – Tom May 09 '09 at 20:48
  • @Unknown: see my comment on your post. – Tom May 09 '09 at 20:52
  • @TZOTZIOY: I am going to add an update mentioning something I tried with this... let me know if you agree. – Tom May 09 '09 at 21:21
  • @Tom, quote from Rax the author: "I'm processing a string, and what I'm looking for is to find the index of first appearance of any of these substrings." If his point is to break on the "FIRST APPEARANCE" then how am I wrong? – Unknown May 12 '09 at 04:41
  • @Unknown: I replied on your post. – Tom May 12 '09 at 06:27
  • @Tom, no you are wrong. It searches through every single sentence until it finds the first match. It does not only search one sentence if no match was found there. – Unknown May 12 '09 at 06:37
  • @Unknown: no... I replied on your post... this is getting ridiculous. – Tom May 12 '09 at 07:32
  • @Tom, no you just have a very narrow viewpoint. Imagine the list as a list of sentences in a paragraph. It still makes sense to be able to find the first match in a list of strings and doesn't make my answer any less valid. – Unknown May 12 '09 at 07:37
  • @Unknown: wasn't saying your answer was invalid... in fact, I AM THE ONLY PERSON TO +1 IT... please don't put words in my mouth. – Tom May 12 '09 at 07:59
4
subs = ['cat', 'fish', 'dog']
sentences = ['0123dog789cat']

import re

subs = re.compile("|".join(subs))
def search():
    for sentence in sentences:
        result = subs.search(sentence)
        if result != None:
            return (result.group(), result.span()[0])

# ('dog', 4)
Unknown
  • 45,913
  • 27
  • 138
  • 182
  • I think he only has 1 "sentence" – Paolo Bergantino May 09 '09 at 07:30
  • Thanks, but this is not what I'm looking for. First, it does not find the first occurrence (in the second sentence it will return the occurrence of "cat", i.e. 10, instead of "dog"'s, i.e. 4). There are obvious solutions but it's very very brute force (iterate until the last substring and constantly maintain the first occurrence). I'm under the impression that Python must have some library function for this... – Roee Adler May 09 '09 at 07:30
  • I don't like when my answers get "sniped" either... but I didn't mean to steal your thunder. +1 because your solution is technically correct. Two comments: it does not discuss the scalability concerns that Rax had, and I don't like the "return" statement, as it would prematurely exit if you had more sentences in sentences. Other than that, it's short and to the point, and warrants some reputation. – Tom May 09 '09 at 20:51
  • @Tom, "I don't like the "return" statement, as it would prematurely exit if you had more sentences in sentences." But I thought Rax wanted to find the first match? – Unknown May 09 '09 at 22:57
  • @Unknown: the reason for my comment was that ***if*** you were to add more sentences to the sentences list, your code would short-circuit because it would only check the first sentence. ie - you shouldn't have used lists for subs and sentences if you weren't going to write code that generalized for larger lists. – Tom May 12 '09 at 02:58
  • Sorry, not just check the first sentence, but only check up to the first sentence that had a match (in this case, the first sentence). – Tom May 12 '09 at 02:59
  • @Unknown: (responding to your comment on my post): For the third time: all I meant was that you made a LIST (sentences) and your code produces the first match IN THE FIRST POSSIBLE SENTENCE.. I am merely saying that it would have been nice if your answer aggregated the first match IN EACH sentence since you wrote it to operate on lists. OR you could have just not used lists. Instead, you picked a way that is mildly confusing if someone were to go generalize it to performing a search on multiple sentences. ie - "Search for cat|fish|dog in each of 10 sentences in the list sentences". Make sense? – Tom May 12 '09 at 06:25
  • @Unknown: (responding to yet another comment on my post): No, I'm not wrong. You don't understand what I am saying because you think I am just trying to criticize you or something. I can't help but laugh at this back and forth. When I have more time, I will try to post as another answer so that I can show you code - then what I am saying will be crystal clear to you (I hope). Don't take what I was saying as an attack - it was a general suggestion about making code you write on posts more clear. – Tom May 12 '09 at 07:32
3

I just want to point out the time difference between DisplacedAussie's answer and Tom's answer. Both were fast when used once, so you shouldn't have any noticeable wait for either, but when you time them:

import random
import re
import string

words = []
letters_and_digits = "%s%s" % (string.letters, string.digits)
for i in range(2000):
    chars = []
    for j in range(10):
        chars.append(random.choice(letters_and_digits))
    words.append(("%s"*10) % tuple(chars))
search_for = re.compile("|".join(words))
first, middle, last = words[0], words[len(words) / 2], words[-1]
search_string = "%s, %s, %s" % (last, middle, first)

def _search():
    match_obj = search_for.search(search_string)
    # Note, if no match, match_obj is None
    if match_obj is not None:
         return (match_obj.start(), match_obj.group())

def _map():
    search_for = search_for.pattern.split("|")
    found = map(lambda x: (search_string.index(x), x), filter(lambda x: x in search_string, search_for))
    if found:
        return min(found, key=lambda x: x[0])


if __name__ == '__main__':
    from timeit import Timer


    t = Timer("_search(search_for, search_string)", "from __main__ import _search, search_for, search_string")
    print _search(search_for, search_string)
    print t.timeit()

    t = Timer("_map(search_for, search_string)", "from __main__ import _map, search_for, search_string")
    print _map(search_for, search_string)
    print t.timeit()

Outputs:

(0, '841EzpjttV')
14.3660159111
(0, '841EzpjttV')
# I couldn't wait this long

I would go with Tom's answer, for both readability, and speed.

Nick Presta
  • 28,134
  • 6
  • 57
  • 76
  • Thanks Nick! In fairness to DisplacedAussie, you could help him out (a little bit) by removing the call to split("|") and just give him a list to start with. To be more comprehensive, you ought to add the brute force approach. for word in search_for:, index = search_string.index(word), if index < smallest_index:, # record the new smallest idx and the word that matched.(Sorry can't write code in comments). Then wait for and post all the times. This is a good thing to consider I wish there could be a special meta-post for things like this, since neither comments nor answer posts are good places. – Tom May 09 '09 at 22:44
  • +1 for actually doing benchmarks in a question about efficiency! – dbr May 10 '09 at 02:36
2

This is a vague, theoretical answer with no code provided, but I hope it can point you in the right direction.

First, you will need a more efficient lookup for your substring list. I would recommend some sort of tree structure. Start with a root, then add an 'a' node if any substrings start with 'a', add a 'b' node if any substrings start with 'b', and so on. For each of these nodes, keep adding subnodes.

For example, if you have a substring with the word "ant", you should have a root node, a child node 'a', a grandchild node 'n', and a great grandchild node 't'.

Nodes should be easy enough to make.

class Node(object):
    children = []

    def __init__(self, name):
        self.name = name

where name is a character.

Iterate through your strings letter by letter. Keep track of which letter you're on. At each letter, try to use the next few letters to traverse the tree. If you're successful, your letter number will be the position of the substring, and your traversal order will indicate the substring that was found.

Clarifying edit: DFAs should be much faster than this method, and so I should endorse Tom's answer. I'm only keeping this answer up in case your substring list changes often, in which case using a tree might be faster.

Community
  • 1
  • 1
Wesley
  • 10,652
  • 4
  • 37
  • 52
  • Thanks, I completely understand the theory and practice of string indexing and searching, and can implement it myself, but I would expect Python to have a vehicle for this exact thing. I understand there's none? – Roee Adler May 09 '09 at 08:06
  • I don't know of such functionality built into Python, so I can't say whether it does or doesn't exist. As such, I'm afraid this answer doesn't help you in the least. The closest answer I see here is Tom's. – Wesley May 09 '09 at 22:34
0

First of all, I would suggest you to sort the initial list in ascending order. Because scanning for a shorter substring is faster that scanning for a longer substring.

Anonymous
  • 3,011
  • 4
  • 24
  • 23
  • Are you sure this makes a difference? If I were implementing the regex myself (as a DFA), the length would not matter. Every substring would be searched for at the same time. I am now curious as to how python implements regexes... – Tom May 09 '09 at 08:04
0

How about this one.

>>> substrings = ['cat', 'fish', 'dog']
>>> _string = '0123dog789cat'
>>> found = map(lambda x: (_string.index(x), x), filter(lambda x: x in _string, substrings))
[(10, 'cat'), (4, 'dog')]
>>> if found:
>>>     min(found, key=lambda x: x[0])
(4, 'dog')

Obviously, you could return something other than a tuple.

This works by:

  • Filtering the list of substrings down to those that are in the string
  • Building a list of tuples containing the index of the substring, and the substring
  • If a substring has been found, find the minimum value based on the index
DisplacedAussie
  • 4,578
  • 1
  • 27
  • 21
  • This seems to be a terribly inefficient answer. It will surely scan the string multiple times. Even a brute force approach where you manually use the string index() method for each string you are searching for (keeping track of the minimum on the fly) is better than this. map() can be a powerful function, but this is not example of such a case. – Tom May 09 '09 at 21:02