13

I have two sets of strings (A and B), and I want to know all pairs of strings a in A and b in B where a is a substring of b.

The first step of coding this was the following:

for a in A:
    for b in B:
        if a in b:
            print (a,b)

However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:, check if the regexp '.*' + a + '.*': matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a. Also, using a list comprehension for the inner for b in B: loop will likely give a pretty big speedup (and a nested list comprehension may be even better).

I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A and B sets, and I don't want it to run all week).

Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!


Edit:

Using the advice of @ninjagecko and @Sven Marnach, I built a quick prefix table of 10-mers:

    import collections
    prefix_table = collections.defaultdict(set)
    for k, b in enumerate(B):
        for i in xrange(len(prot_seq)-10):
            j = i+10+1
            prefix_table[b[i:j]].add(k)

    for a in A:
        if len(a) >= 10:
            for k in prefix_table[a[:10]]:
                # check if a is in b
                # (missing_edges is necessary, but not sufficient)
                if a in B[k]:
                    print (a,b)
        else:
            for k in xrange(len(prots_and_seqs)):
                # a is too small to use the table; check if
                # a is in any b
                if a in B[k]:
                    print (a, b)
user
  • 7,123
  • 7
  • 48
  • 90
  • 1
    Can you roughly say 1. how many strings are in `A`, 2. how many strings are in `B`, 3. how many hits do you get on average for each string in `A`? – Sven Marnach Nov 27 '11 at 21:01
  • `len(A)` is around `70E3`, `len(B) is around `10E3`. The average number of hits for each string in `A` is around 5 (matches are quite sparse). – user Nov 27 '11 at 21:08
  • 2
    Is your question "Can regular expressions make this faster?" or "Is there a way to make this faster with a better algorithm?". Please consider clarifying the scope of your question. – ninjagecko Nov 27 '11 at 21:08
  • 2
    What is the distribution of lengths of A-strings and B-strings? – ninjagecko Nov 27 '11 at 21:09
  • The `A` strings are generally between 6 and 30 characters long and the `B` strings are between 100 and 1000 characters long. – user Nov 28 '11 at 18:23
  • @ninjagecko I'm trying to find something more efficient than the "Occam's razor" code from the question, but something that's simple enough to code. It's the classic python problem-- I just need to use this around a hundred times, but on my machine, the naive way will take all week... – user Nov 28 '11 at 18:28

4 Answers4

9

Of course you can easily write this as a list comprehension:

[(a, b) for a in A for b in B if a in b]

This might slightly speed up the loop, but don't expect too much. I doubt using regular expressions will help in any way with this one.

Edit: Here are some timings:

import itertools
import timeit
import re
import collections

with open("/usr/share/dict/british-english") as f:
    A = [s.strip() for s in itertools.islice(f, 28000, 30000)]
    B = [s.strip() for s in itertools.islice(f, 23000, 25000)]

def f():
    result = []
    for a in A:
        for b in B:
            if a in b:
                result.append((a, b))
    return result

def g():
    return [(a, b) for a in A for b in B if a in b]

def h():
    res = [re.compile(re.escape(a)) for a in A]
    return [(a, b) for a in res for b in B if a.search(b)]

def ninjagecko():
    d = collections.defaultdict(set)
    for k, b in enumerate(B):
        for i, j in itertools.combinations(range(len(b) + 1), 2):
            d[b[i:j]].add(k)
    return [(a, B[k]) for a in A for k in d[a]]

print "Nested loop", timeit.repeat(f, number=1)
print "List comprehension", timeit.repeat(g, number=1)
print "Regular expressions", timeit.repeat(h, number=1)
print "ninjagecko", timeit.repeat(ninjagecko, number=1)

Results:

Nested loop [0.3641810417175293, 0.36279606819152832, 0.36295199394226074]
List comprehension [0.362030029296875, 0.36148500442504883, 0.36158299446105957]
Regular expressions [1.6498990058898926, 1.6494300365447998, 1.6480278968811035]
ninjagecko [0.06402897834777832, 0.063711881637573242, 0.06389307975769043]

Edit 2: Added a variant of the alogrithm suggested by ninjagecko to the timings. You can see it is much better than all the brute force approaches.

Edit 3: Used sets instead of lists to eliminate the duplicates. (I did not update the timings -- they remained essentially unchanged.)

Community
  • 1
  • 1
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 1
    note that `ninjagecko()` produces a few duplicates, which you can see if you take bigger slices of the file (lets say slices of length 20000 for A and B) and compare the `len()` of the outcomes of `g()` and `ninjagecko()`, so you might want to take the `set()`. – tobigue Nov 27 '11 at 23:29
  • @eowl: Good point. (It doesn't change much about the results, though.) – Sven Marnach Nov 27 '11 at 23:34
  • Your implementation of @ninjagecko's method is tricky to read, but amazingly compact! What a great way to do this. – user Nov 28 '11 at 18:34
  • @Oliver: With the length distributions given above, it is enough to cache all substrings of lengths 6 to 30, which will be considerably less than all the substrings. – Sven Marnach Nov 28 '11 at 19:07
8

Let's assume your words are bounded at a reasonable size (let's say 10 letters). Do the following to achieve linear(!) time complexity, that is, O(A+B):

  • Initialize a hashtable or trie
  • For each string b in B:
    • For every substring of that string
      • Add the substring to the hashtable/trie (this is no worse than 55*O(B)=O(B)), with metadata of which string it belonged to
  • For each string a in A:
    • Do an O(1) query to your hashtable/trie to find all B-strings it is in, yield those

(As of writing this answer, no response yet if OP's "words" are bounded. If they are unbounded, this solution still applies, but there is a dependency of O(maxwordsize^2), though actually it's nicer in practice since not all words are the same size, so it might be as nice as O(averagewordsize^2) with the right distribution. For example if all the words were of size 20, the problem size would grow by a factor of 4 more than if they were of size 10. But if sufficiently few words were increased from size 10->20, then the complexity wouldn't change much.)

edit: https://stackoverflow.com/q/8289199/711085 is actually a theoretically better answer. I was looking at the linked Wikipedia page before that answer was posted, and was thinking "linear in the string size is not what you want", and only later realized it's exactly what you want. Your intuition to build a regexp (Aword1|Aword2|Aword3|...) is correct since the finite-automaton which is generated behind the scenes will perform matching quickly IF it supports simultaneous overlapping matches, which not all regexp engines might. Ultimately what you should use depends on if you plan to reuse the As or Bs, or if this is just a one-time thing. The above technique is much easier to implement but only works if your words are bounded (and introduces a DoS vulnerability if you don't reject words above a certain size limit), but may be what you are looking for if you don't want the Aho-Corasick string matching finite automaton or similar, or it is unavailable as a library.

Community
  • 1
  • 1
ninjagecko
  • 88,546
  • 24
  • 137
  • 145
  • While this is what the OP explicitly didn't ask for, it's the way to go anyway. It is quite easy to implement in Python. – Sven Marnach Nov 27 '11 at 21:26
  • 1
    I beg to disagree with your edit. While the linked answer is theoretically better, it is also much more complex to implement, while what you suggested are 5 lines of Python code -- see the update to my answer. – Sven Marnach Nov 27 '11 at 21:43
  • 1
    It seems deceptive to say that this algorithm is linear once we assume that the strings in B are bounded, because the figure of 55 is still derived from the length of B. The *algorithm* is `O(averagewordsize^2)`, and fixing the average word size in advance doesn't change that. – ben w Nov 27 '11 at 21:44
  • @SvenMarnach: I did mention in my edit that the above technique is easier to implement and may be preferable for that reason. – ninjagecko Nov 27 '11 at 21:45
  • 1
    @ben: I understand what you are trying to say, but what I said is completely straightforward and not deceptive at all. An algorithm takes a problem (which includes assumptions; a restricted problem is still a problem) and gives a solution plus running time. I was saying that it was likely the OP's problem is actually this restricted problem, and additionally unlikely to be the `O(averagewordsize^2)` problem. Is an `O(X)` algorithm which applies to only positive integers not really `O(X)` because the best-known algorithm which works for negative integers too is `O(X^2)`? No, it's still `O(X)`. – ninjagecko Nov 27 '11 at 21:54
  • +1 Your "word hashing" idea is great. At first I thought it wouldn't be trivial to code, but the above implementation does it justice! – user Nov 28 '11 at 18:32
7

A very fast way to search for a lot of strings is to make use of a finite automaton (so you were not that far with the guess of regexp), namely the Aho Corasick string matching machine, which is used in tools like grep, virus scanners and the like.

First it compiles the strings you want to search for (in your case the words in A) into a finite-state automaton with failure function (see the paper from '75 if you are interested in details). This automaton then reads the input string(s) and outputs all found search strings (probably you want to modify it a bit, so that it outputs the string in which the search string was found aswell).

This method has the advantage that it searches all search strings at the same time and thus needs to look at every character of the input string(s) only once (linear complexity)!

There are implementations of the aho corasick pattern matcher at pypi, but i haven't tested them, so I can't say anything about performance, usability or correctness of these implementations.


EDIT: I tried this implementation of the Aho-Corasick automaton and it is indeed the fastest of the suggested methods so far, and also easy to use:

import pyahocorasick

def aho(A, B):
    t = pyahocorasick.Trie();
    for a in A:
        t.add_word(a, a)
    t.make_automaton()
    return [(s,b) for b in B for (i,res) in t.iter(b) for s in res]

One thing I observed though, was when testing this implementation with @SvenMarnachs script it yielded slightly less results than the other methods and I am not sure why. I wrote a mail to the creator, maybe he will figure it out.

tobigue
  • 3,557
  • 3
  • 25
  • 29
0

There are specialized index structures for this, see for example http://en.wikipedia.org/wiki/Suffix_tree

You'd build a suffix-tree or something similar for B, then use A to query it.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194