Oh, you're just counting how MANY have matches with a hamming distance < 2? That can be done much quicker.
total_count = 0
for line in f:
# skip the s = f.readline() since that's what `line` is in this
line = line.strip() # just in case
for ll in l:
if hamming(line, ll) <= 2:
total_count += 1
break # skip the rest of the ll in l loop
# and then you don't need any processing afterwards either.
Note that most of your code time will be spent on the line:
if hamming(line, ll) <= 2:
So any way you can improve that algorithm will GREATLY improve your overall script speed. Boud's answer extols the virtues of jellyfish
's hamming_distance
function, but without any personal experience I can't recommend it myself. However his advice to use a faster implementation of hamming distance is sound!
Peter DeGlopper suggests blowing the l
list into six different sets of "Two or less hamming distance" matches. That is, a group of sets that contain all the possible pairs that could have two or less hamming distance. This might look like:
# hamming_sets is [ {AB??}, {A?C?}, {A??D}, {?BC?}, {?B?D}, {??CD} ]
hamming_sets = [ set(), set(), set(), set(), set(), set() ]
for ll in l:
# this should take the lion's share of time in your program
hamming_sets[0].add(l[0] + l[1])
hamming_sets[0].add(l[0] + l[2])
hamming_sets[0].add(l[0] + l[3])
hamming_sets[0].add(l[1] + l[2])
hamming_sets[0].add(l[1] + l[3])
hamming_sets[0].add(l[2] + l[3])
total_count = 0
for line in f:
# and this should be fast, even if `f` is large
line = line.strip()
if line[0]+line[1] in hamming_sets[0] or \
line[0]+line[2] in hamming_sets[1] or \
line[0]+line[3] in hamming_sets[2] or \
line[1]+line[2] in hamming_sets[3] or \
line[1]+line[3] in hamming_sets[4] or \
line[2]+line[3] in hamming_sets[5]:
total_count += 1
You could possibly gain readability by making hamming_sets
a dictionary of transform_function: set_of_results
key value pairs.
hamming_sets = {lambda s: s[0]+s[1]: set(),
lambda s: s[0]+s[2]: set(),
lambda s: s[0]+s[3]: set(),
lambda s: s[1]+s[2]: set(),
lambda s: s[1]+s[3]: set(),
lambda s: s[2]+s[3]: set()}
for func, set_ in hamming_sets.items():
for ll in l:
set_.add(func(ll))
total_count = 0
for line in f:
line = line.strip()
if any(func(line) in set_ for func, set_ in hamming_sets.items()):
total_count += 1