6

I have a huge number of records containing sequences ('ATCGTGTGCATCAGTTTCGA...'), up to 500 characters. I also have a list of smaller sequences, usually 10-20 characters. I would like to use the Levenshtein distance in order to find these smaller sequences inside the records allowing for small changes or indels (L_distance <=2).

The problem is that I also want to get the start position of such smaller sequences, and apparently it only compares sequences of the same length.

>>> import Levenshtein
>>> s1 = raw_input('first word: ')
first word: ATCGTAATACGATCGTACGACATCGCGGCCCTAGC
>>> s2 = raw_input('second word: ')
first word: TACGAT
>>> Levenshtein.distance(s1,s2)
29

In this example I would like to obtain the position (7) and the distance (0 in this case).

Is there an easy way to solve this problem, or do I have to break the larger sequences into smaller ones and then run the Levenshtein distance for all of them? That might take too much time.

Thanks.

UPDATE #Naive implementation generating all substrings after looking for an exact match.

def find_tag(pattern,text,errors):       
    m = len(pattern)
    i=0
    min_distance=errors+1
    while i<=len(text)-m:
        distance = Levenshtein.distance(text[i:i+m],pattern)
        print text[i:i+m],distance #to see all matches.
        if distance<=errors:
            if distance<min_distance:
                match=[i,distance]
                min_distance=distance
        i+=1
    return match

#Real example. In this case just looking for one pattern, but we have about 50.
import re, Levenshtein

text = 'GACTAGCACTGTAGGGATAACAATTTCACACAGGTGGACAATTACATTGAAAATCACAGATTGGTCACACACACATTGGACATACATAGAAACACACACACATACATTAGATACGAACATAGAAACACACATTAGACGCGTACATAGACACAAACACATTGACAGGCAGTTCAGATGATGACGCCCGACTGATACTCGCGTAGTCGTGGGAGGCAAGGCACACAGGGGATAGG' #Example of a record
pattern = 'TGCACTGTAGGGATAACAAT' #distance 1
errors = 2 #max errors allowed

match = re.search(pattern,text)

if match:
    print [match.start(),0] #First we look for exact match
else:
    find_tag(pattern,text,errors)
biojl
  • 1,060
  • 1
  • 8
  • 26
  • possible duplicate of [Checking fuzzy/approximate substring existing in a longer string, in Python?](http://stackoverflow.com/questions/17740833/checking-fuzzy-approximate-substring-existing-in-a-longer-string-in-python) – flup Nov 01 '13 at 10:50
  • I already read that one and it's not fully answered there. How do you get the position? The're just creating substrings as I pointed out in the original question. – biojl Nov 01 '13 at 11:14
  • get_matching_blocks docs say: _Return list of triples describing matching subsequences. Each triple is of the form (i, j, n), and means that a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in i and j._ Won't that give you the position? – flup Nov 01 '13 at 11:22
  • Do you need this to be efficient? Would just iterating over sub-strings of each sequence and calculating the L-distance be too slow? – taleinat Nov 01 '13 at 11:30
  • I still don't see how can I extract the position from that one-liner. Yes I need an efficient code since I have millions of records and the possible substrings to look for are about 50-60. – biojl Nov 01 '13 at 11:33
  • Did you look into https://pypi.python.org/pypi/regex/ ? It is mentioned in one of the answers to the other question. – Janne Karila Nov 01 '13 at 13:25
  • The other question mentioned does not contain the answer I'm looking for. Actually the code there doesn't even fully answer the question posted there. I edited the question with a solution that works for me but It's still not perfect as you can read in my comments to Tailenat answer below. – biojl Nov 11 '13 at 11:00

1 Answers1

8

Assuming the maximum Levenshtein distance allowed is small, this can be done in a single-pass while keeping a list of candidates for fuzzy matches.

Here is an example implementation I just worked up. It's not thoroughly tested, documented or optimized. But at least it works for simple examples (see below). I've tried to avoid having it return several matches due to skipping characters at the edges of the sub-sequence, but as I've said, I haven't thoroughly tested this.

If you're interested I'll be happy to clean this up, write some tests, do basic optimization and make this available as an open-source library.

from collections import namedtuple

Candidate = namedtuple('Candidate', ['start', 'subseq_index', 'dist'])
Match = namedtuple('Match', ['start', 'end', 'dist'])

def find_near_matches(subsequence, sequence, max_l_dist=0):
    prev_char = None
    candidates = []
    for index, char in enumerate(sequence):
        for skip in range(min(max_l_dist+1, len(subsequence))):
            candidates.append(Candidate(index, skip, skip))
            if subsequence[skip] == prev_char:
                break
        new_candidates = []
        for cand in candidates:
            if char == subsequence[cand.subseq_index]:
                if cand.subseq_index + 1 == len(subsequence):
                    yield Match(cand.start, index + 1, cand.dist)
                else:
                    new_candidates.append(cand._replace(
                        subseq_index=cand.subseq_index + 1,
                    ))
            else:
                if cand.dist == max_l_dist or cand.subseq_index == 0:
                    continue
                # add a candidate skipping a sequence char
                new_candidates.append(cand._replace(dist=cand.dist + 1))
                # try skipping subsequence chars
                for n_skipped in range(1, max_l_dist - cand.dist + 1):
                    if cand.subseq_index + n_skipped == len(subsequence):
                        yield Match(cand.start, index + 1, cand.dist + n_skipped)
                        break
                    elif subsequence[cand.subseq_index + n_skipped] == char:
                        # add a candidate skipping n_skipped subsequence chars
                        new_candidates.append(cand._replace(
                            dist=cand.dist + n_skipped,
                            subseq_index=cand.subseq_index + n_skipped,
                        ))
                        break
        candidates = new_candidates
        prev_char = char

Now:

>>> list(find_near_matches('bde', 'abcdefg', 0))
[]
>>> list(find_near_matches('bde', 'abcdefg', 1))
[Match(start=1, end=5, dist=1), Match(start=3, end=5, dist=1)]
>>> list(find_near_matches('cde', 'abcdefg', 0))
[Match(start=2, end=5, dist=0)]
>>> list(find_near_matches('cde', 'abcdefg', 1))
[Match(start=2, end=5, dist=0)]
>>> match = _[0]
>>> 'abcdefg'[match.start:match.end]
'cde'

EDIT:

Following this question, I'm writing a Python library for searching for nearly matching sub-sequences: fuzzysearch. It it still very much a work-in-progress.

For now, try the find_near_matches_with_ngrams() function! It should perform particularly well in your use-case.

taleinat
  • 8,441
  • 1
  • 30
  • 44
  • Hi @taleinat, this is great! Usually we expect to find the small sequences in the records without any modifications (distance = 0) or very few (1 or 2). I'm updating the question with the naive code I generated. I would like know how much faster is your solution. And it will be great to have a library, there's more people interested in this algorithm. – biojl Nov 01 '13 at 14:15
  • >>> list(find_near_matches(pattern,text,1)) [Match(start=3, end=24, dist=1), Match(start=5, end=24, dist=1)] In this case the real one is the first one, as you expect to find the whole pattern, not a piece. – biojl Nov 04 '13 at 09:38
  • 1
    Well, that's an example of issues with the edge of the subsequence. What if the entire subsequence is there except the first letter? The second match is just as valid as the first if your only criterion is Levenshtein distance. In any case, I'm still working on this, and will add filtering out of such duplicate matches, returning only the "best" one. – taleinat Nov 05 '13 at 13:40
  • You're right. I've noticed that my solution (and yours) fail to localize cases when a insertion/deletion is longer than 1 character. I'm putting the -- where the indel should be. i.e. `ATCTATCGTTTACGG ATCTAT--TTTACGGCCGTAGCTAGCTGACTGATCGTA` – biojl Nov 05 '13 at 14:30
  • It finds the pattern, but not in a distance of 2. For the same solution your code assigns a score of 6 and mine of 4. I'm still trying to understand what's happening. – biojl Nov 05 '13 at 14:39
  • Any progress on this @taleinat – biojl Nov 18 '13 at 17:12
  • 1
    As mentioned in my edit of the answer, I've published a Python module called `fuzzysearch` with relevant functionality. – taleinat Mar 12 '14 at 11:13
  • @taleinat What is the running time of the algorithm? Looks like it's polynomial. – Vitaly May 08 '14 at 05:37
  • Based on a quick analysis, the running time depends linearly on the length of the searched string and quadratically on the maximum allowed Levenshtein distance and the sub-sequence length. If the actual running time is important to you, take a look at my `fuzzysearch` library, which includes a much improved implementation which is considerably faster. – taleinat Jul 10 '14 at 10:35