increase speed of substring searching in a list

Question

I am quite new to Python and I am struggling to increase the speed of one piece of code.

I have a dictionary containing 500k DNA sequences. As a key, I have the identifier of the sequence, while as a value I have the corresponding DNA sequence. These sequences are of variable length (it is just a string containing CTACTA...) that could has 200 to 60k nucleotides. I need to remove DNA sequences that are substrings of larger sequences.

I wrote this:

def remove_subs():

    #Create a list of values based on reversed lenght
    LISTA=sorted(list(x for x in finaldic.values()), key=len, reverse=True)

    LISTA2=[]

    for a in range(len(LISTA)):
        #run the same list but in opposite direction 
        for b in range(len(sorted(LISTA,key=len))):
            if len(LISTA[b])<len(LISTA[a]):
                if LISTA[a].find(LISTA[b])!=-1 or Bio.Seq.reverse_complement(LISTA[a]).find(LISTA[b])!=-1 and LISTA[b]!=LISTA[a]:
                    LISTA2.append(LISTA[a])

I am trying to identify those substring sequences by running in two for loops, a list containing only the DNA sequences (ordered by length), in opposite directions using the built-in .find

This code works perfectly but takes ages to run such amount of information. I am quite sure that exists some faster option.

Can you help?

have you tried the obvious google search terms? I'm pretty sure there is related material *for* DNA sequences. — Karoly Horvath, Nov 08 '13 at 19:43
Working with DNA sequences is a big research field in computer science. There is *a lot* material for it. — poke, Nov 08 '13 at 19:44

score 1 · Accepted Answer · edited May 23 '17 at 10:25

From an algorithmic perspective, you likely should look at suffix trees. First, you build a generalized suffix tree from the strings you want to look in, which has an O(n) time complexity to construct (where n = number of characters in all strings to search through). Then, you can query that tree and as it if a substring is contained within it, which has a O(m) time complexity, where m is length of the substring. This, essentially, is as fast as it possibly could be.

Stack overflow question describing a few suffix tree libraries:

python: library for generalized suffix trees

Unfortunately, the examples here are not terribly mature codebases... There are C libraries which are significantly more focused on optimization and so on. Nonetheless, something like this suffix tree algorithm should be a simple drop-in replacement for your code:

import SubstringDict
d = SubstringDict.SubstringDict()
d['foobar'] = 1  
d['barfoo'] = 2
d['forget'] = 3
d['arfbag'] = 4

print(d['a'])
# [1, 2, 4]
print(d['arf'])
# [2, 4]
print (d['oo'])
# [1, 2]
print(d['food'])
# []

Searching and matching strings is a pretty large and active area in bioinformatics, and there is an entire body of literature on this problem.

score 0 · Answer 2 · edited May 23 '17 at 12:34

Just to clean this up so it's a bit more understandable:

def remove_subs():
    list_a = sorted(list(x for x in finaldic.values()), key=len, reverse=True)
    matches = []
    for first in list_a:
        for second in (sorted(list_a, key=len)):
            if first in second or first in Bio.Seq.reverse_complement(second):
                matches.append(first)
                break

You might see a speed-up just by using break.

This can be made smaller by using:

def remove_subs():
    list_a = sorted(list(x for x in finaldic.values()), key=len, reverse=True)
    matches = []
    for s in list_a:
        if any(substring in s for substring in list_a):
            matches.append(s)

Also, use this topic as a reference for algorithms.

score 0 · Answer 3 · answered Nov 08 '13 at 19:48

Here are some fixes that might increase your speed. At the least it will make your code more idiomatic to python.

def remove_subs():

    #Create a list of values based on reversed lenght
    list_a=sorted((x for x in finaldic.values()), key=len, reverse=True)

    list_a_2=[]

    for a in list_a:
        #run the same list but in opposite direction 
        for b in sorted(list_a,key=len):
            if len(b)<len(a):
                if b in a or b in Bio.Seq.reverse_complement(a) and b!=a:
                    list_a_2.append(a)

Two main changes: 1) Instead of using the .find method, I am using python's in operator to search. 2) Instead of indexing your lists, just loop over them directly.

You could probably get away with the if len(b) < len(a) condition, since b will never be in a if that is not true.

score 0 · Answer 4 · answered Nov 08 '13 at 21:01

0

I have an idea that could help, what about hashing the sequences? If the length of the smallest sequence is 200, then I would do a rolling hash (http://en.wikipedia.org/wiki/Rolling_hash) with window size being 200. I would then use the hash as a key into a dictionary, which would hold a list of sequence identifiers. Then if there is a list of size > 1, it's a candidate for substring (there can be collisions), and you can use find.

answered Nov 08 '13 at 21:01

David Marek

548
2
9

That makes me think... You could try a bloom filter, if you don't need 100% accurate results. Bloom filters are set implementations that never give a false negative, but can infrequently (and adjustably) give a false positive. https://pypi.python.org/pypi/drs-bloom-filter – dstromberg Nov 08 '13 at 21:30

score 0 · Answer 5 · answered Nov 08 '13 at 21:27

Not having any test data or self-contained code, it's hard to test, but I'll point out that sorting inside a loop is rarely a good idea. This should bring the running time down to O(n^2) from O(n^3 * logn) :

def remove_subs():
    list_a_backward = sorted(list(x for x in finaldic.values()), key=len, reverse=True)
    list_a_forward = list_a_backward
    list_a_forward.reverse()

    matches = []
    for first in list_a_backward:
        for second in list_a_forward:
            if first in second or first in Bio.Seq.reverse_complement(second):
                matches.append(first)
                break

You could try Pypy too, since you appear to be running pure python. Failing that, numba or Cython might help.

increase speed of substring searching in a list

5 Answers5