1

I have 2 files with around 100,000,000 lines that need to be compared to each other. As stated in the title I want to compare each line from the files to each other. I have the code below which works absolutely fine, however I wish to adapt it so that if a mismatch occurs during a long match then it's accepted with a mismatch level of 5%.

Below is the function I use for matching the lines of the files.

ret1 = []
merging = {} 
def slide_merge(seq1, seq2):
    for i in xrange(min(len(seq1), len(seq2))):
        if seq1[i] == 'N':
            ret1.append(seq1[i])
            print (''.join(ret1))
        elif seq2[i] == 'N':
            ret1.append(seq1[i])
            print (''.join(ret1))
        elif seq1[i] != seq2[i]:
            break
        else:
            ret1.append(seq1[i])
            print (''.join(ret1))
    print ("strings share a longest common prefix of length:", len(ret1), "out of:", len(seq1))
    ret1len = len(ret1)
    merging[''.join(ret1)] = ret1len # Adds details to dictionary
    return merging

The below code is how the above function is used within the code and how I get the longest match.

while len(rc1u) >= 50: # So matches of 8 are included
    slide_merge(rc1u, rc2wr)      ### rc1u all cut up here so of no further use
    rc1u = rc1u[1:]
merging
max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest

Incase it matters I am using HTSeq to input the files which are genetic sequencing.

So the question is how could I adapt this code or make another code which compares 2 strings and identifies the longest matching sequence from the start whilst allowing for 5% mismatches to occur so for example:

string1 = AAAAATTTTTCCCCCGGGGGTTTTT
string2 = AAAAATTTTTCCCCCGGGGATTTTT

The code should see that the 2 strings match entirely apart from 1 character but as that is less than 5% of the total the matched region should be stated as: matched 25

Michelle
  • 2,830
  • 26
  • 33
Tom
  • 469
  • 4
  • 7
  • 16
  • what is the question ? – Sikorski Jun 13 '14 at 11:57
  • Sorry will edit to make clearer. – Tom Jun 13 '14 at 11:59
  • @Sikorski FWIW, I don't think it's unclear what the question is (unless "how do I modify this to allow up to a 5% difference still count as a match" _isn't_ the question). – Michelle Jun 13 '14 at 12:00
  • @Michelle that is the question but incase anyone else doesn't get it I have added an example and restated the question at the bottom. – Tom Jun 13 '14 at 12:03
  • 1
    You probably want the edit distance between two strings. See http://stackoverflow.com/questions/2460177/edit-distance-in-python – HugoRune Jun 13 '14 at 12:06
  • Isn't this going to take `O(n²)` time? For `n = 1E8`, you're going to have troubles. – Veedrac Jun 13 '14 at 12:26

1 Answers1

1

You can compute the Levenshtein distance between those words and then find the percentage of 'mismatch between those words'.

An example of implemenation is provided here.

Let's say that the function computing the distance between two strings is called dis_lev, you could evaluate the percentage this way:

from __future__ import division

distance = dis_lev(string1, string2)
mismatch_ratio = distance / len(string1)
if mismatch_ratio > 0.05:
    raise MyAwesomeException("Hey ! These things do not match at all !")

For example, using your example and the iterative implementation available in the links I've provided:

>>> distance = dis_lev("AAAAATTTTTCCCCCGGGGGTTTTT", "AAAAATTTTTCCCCCGGGGATTTTT")
>>> distance
1
>>> mismatch_ratio = distance / len("AAAAATTTTTCCCCCGGGGGTTTTT")
0.04 

Edit: Depending on your case, you could use another metrics, some are listed here

Ketouem
  • 3,820
  • 1
  • 19
  • 29
  • 1
    I seriously doubt you can compute the Levenshtein distance between files of `100M` in reasonable time. I think a heuristic approach will be more appropriate... – Willem Van Onsem Jun 13 '14 at 12:29
  • @CommuSoft how would a heuristic approach be applied? – Tom Jun 13 '14 at 15:53
  • 1
    @Tom you may be interested by the following http://stackoverflow.com/questions/49263/approximate-string-matching-algorithms people discussed algorithm that relates to biological data. – Ketouem Jun 13 '14 at 16:09