I have two sets of strings (A
and B
), and I want to know all pairs of strings a in A
and b in B
where a
is a substring of b
.
The first step of coding this was the following:
for a in A:
for b in B:
if a in b:
print (a,b)
However, I wanted to know-- is there a more efficient way to do this with regular expressions (e.g. instead of checking if a in b:
, check if the regexp '.*' + a + '.*':
matches 'b'. I thought that maybe using something like this would let me cache the Knuth-Morris-Pratt failure function for all a
. Also, using a list comprehension for the inner for b in B:
loop will likely give a pretty big speedup (and a nested list comprehension may be even better).
I'm not very interested in making a giant leap in the asymptotic runtime of the algorithm (e.g. using a suffix tree or anything else complex and clever). I'm more concerned with the constant (I just need to do this for several pairs of A
and B
sets, and I don't want it to run all week).
Do you know any tricks or have any generic advice to do this more quickly? Thanks a lot for any insight you can share!
Edit:
Using the advice of @ninjagecko and @Sven Marnach, I built a quick prefix table of 10-mers:
import collections
prefix_table = collections.defaultdict(set)
for k, b in enumerate(B):
for i in xrange(len(prot_seq)-10):
j = i+10+1
prefix_table[b[i:j]].add(k)
for a in A:
if len(a) >= 10:
for k in prefix_table[a[:10]]:
# check if a is in b
# (missing_edges is necessary, but not sufficient)
if a in B[k]:
print (a,b)
else:
for k in xrange(len(prots_and_seqs)):
# a is too small to use the table; check if
# a is in any b
if a in B[k]:
print (a, b)