I have 2 files with around 100,000,000 lines that need to be compared to each other. As stated in the title I want to compare each line from the files to each other. I have the code below which works absolutely fine, however I wish to adapt it so that if a mismatch occurs during a long match then it's accepted with a mismatch level of 5%.
Below is the function I use for matching the lines of the files.
ret1 = []
merging = {}
def slide_merge(seq1, seq2):
for i in xrange(min(len(seq1), len(seq2))):
if seq1[i] == 'N':
ret1.append(seq1[i])
print (''.join(ret1))
elif seq2[i] == 'N':
ret1.append(seq1[i])
print (''.join(ret1))
elif seq1[i] != seq2[i]:
break
else:
ret1.append(seq1[i])
print (''.join(ret1))
print ("strings share a longest common prefix of length:", len(ret1), "out of:", len(seq1))
ret1len = len(ret1)
merging[''.join(ret1)] = ret1len # Adds details to dictionary
return merging
The below code is how the above function is used within the code and how I get the longest match.
while len(rc1u) >= 50: # So matches of 8 are included
slide_merge(rc1u, rc2wr) ### rc1u all cut up here so of no further use
rc1u = rc1u[1:]
merging
max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest = max(merging.iteritems(), key=operator.itemgetter(1))[0]
highest
Incase it matters I am using HTSeq to input the files which are genetic sequencing.
So the question is how could I adapt this code or make another code which compares 2 strings and identifies the longest matching sequence from the start whilst allowing for 5% mismatches to occur so for example:
string1 = AAAAATTTTTCCCCCGGGGGTTTTT
string2 = AAAAATTTTTCCCCCGGGGATTTTT
The code should see that the 2 strings match entirely apart from 1 character but as that is less than 5% of the total the matched region should be stated as: matched 25