5

I have a list of sequences l (many 1000's of sequences): l = [ABCD,AABA,...]. I also have a file f with many 4 letter sequences (around a million of them). I want to choose the closest string in l for every sequence in f up to a Hamming distance of atmost 2, and update the counter good_count. I wrote the following code for this but it is very slow. I was wondering if it can be done faster.

def hamming(s1, s2):
    if len(s1) != len(s2):
            raise ValueError("Undefined for sequences of unequal length")  
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

f = open("input.txt","r")

l = [ABCD,AABA,...]

good_count = 0
for s in f:
     x = f.readline()
     dist_array = []
     for ll in l:
        dist = hamming(x,ll)
        dist_array.append(dist)
     min_dist = min(dist_array)
     if min_dist <= 2:
        good_count += 1
print good_count

It works fast if l and f are small but takes too long for large l and f. Please suggest a quicker solution.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
Ssank
  • 3,367
  • 7
  • 28
  • 34
  • 1
    I'd suggest you parse that file first and save elements in a list, then apply `filter` and then return the size of the list. – Tony Tannous Jan 17 '17 at 20:17
  • 1
    a little advice , you can calculate min during iteration and you don't need another loop for `min(dist_array)` – ᴀʀᴍᴀɴ Jan 17 '17 at 20:18
  • Another advice - if the distance is 0, you can break the loop. – khachik Jan 17 '17 at 20:22
  • 1
    There are only ~457k possible 4 letter sequences. You may also want to consider eliminating the duplicates before you check each one. (Assuming we're limited to the 26 English letters) – MrAlexBailey Jan 17 '17 at 20:26
  • @khachik in fact if the distance in <= 2 you can break the loop, since we never use `ll` in anything! – Adam Smith Jan 17 '17 at 20:26
  • What is that `x = f.readline()` for? Do you want to skip every other line (those stored in `s` but never used)? – tobias_k Jan 17 '17 at 20:32
  • 1
    If all that's needed is the count, maybe use a data structure that can be put into a `set` to represent the elements of `I`? Basically blow out each pattern into eg `AB??`, `A?C?`, `A??D`, `?BC?`, `?B?D`, `??CD`, and store all the valid "if it matches this it is at most 2 away from a valid sequence" representations in a set. Then expand each line into all 6 of its corresponding elements and check for set membership. Six set membership tests should be significantly faster than iterating through a list of many thousands of sequences even on a big set. – Peter DeGlopper Jan 17 '17 at 20:34
  • @PeterDeGlopper do you mind if I add that to my answer? It's a great suggestion for large `f` datasets – Adam Smith Jan 17 '17 at 20:37
  • 1
    If you do need the closest match for something (not shown in the above code) you could try doing a set of all exact patterns, one of all distance one patterns, and all distance two patterns. That's still only 11 things to check for, but at some point the sets start to take noticeable memory. – Peter DeGlopper Jan 17 '17 at 20:37
  • 1
    @AdamSmith - feel free. – Peter DeGlopper Jan 17 '17 at 20:37

2 Answers2

2

Use existing libraries, for instance jellyfish:

from jellyfish import hamming_distance

Which gives you a C implementation of the hamming distance.

Zeugma
  • 31,231
  • 9
  • 69
  • 81
2

Oh, you're just counting how MANY have matches with a hamming distance < 2? That can be done much quicker.

total_count = 0

for line in f:
    # skip the s = f.readline() since that's what `line` is in this
    line = line.strip()  # just in case
    for ll in l:
        if hamming(line, ll) <= 2:
            total_count += 1
            break  # skip the rest of the ll in l loop
    # and then you don't need any processing afterwards either.

Note that most of your code time will be spent on the line:

        if hamming(line, ll) <= 2:

So any way you can improve that algorithm will GREATLY improve your overall script speed. Boud's answer extols the virtues of jellyfish's hamming_distance function, but without any personal experience I can't recommend it myself. However his advice to use a faster implementation of hamming distance is sound!


Peter DeGlopper suggests blowing the l list into six different sets of "Two or less hamming distance" matches. That is, a group of sets that contain all the possible pairs that could have two or less hamming distance. This might look like:

# hamming_sets is [ {AB??}, {A?C?}, {A??D}, {?BC?}, {?B?D}, {??CD} ]
hamming_sets = [ set(), set(), set(), set(), set(), set() ]

for ll in l:
    # this should take the lion's share of time in your program
    hamming_sets[0].add(l[0] + l[1])
    hamming_sets[0].add(l[0] + l[2])
    hamming_sets[0].add(l[0] + l[3])
    hamming_sets[0].add(l[1] + l[2])
    hamming_sets[0].add(l[1] + l[3])
    hamming_sets[0].add(l[2] + l[3])

total_count = 0

for line in f:
    # and this should be fast, even if `f` is large
    line = line.strip()
    if line[0]+line[1] in hamming_sets[0] or \
       line[0]+line[2] in hamming_sets[1] or \
       line[0]+line[3] in hamming_sets[2] or \
       line[1]+line[2] in hamming_sets[3] or \
       line[1]+line[3] in hamming_sets[4] or \
       line[2]+line[3] in hamming_sets[5]:
        total_count += 1

You could possibly gain readability by making hamming_sets a dictionary of transform_function: set_of_results key value pairs.

hamming_sets = {lambda s: s[0]+s[1]: set(),
                lambda s: s[0]+s[2]: set(),
                lambda s: s[0]+s[3]: set(),
                lambda s: s[1]+s[2]: set(),
                lambda s: s[1]+s[3]: set(),
                lambda s: s[2]+s[3]: set()}

for func, set_ in hamming_sets.items():
    for ll in l:
        set_.add(func(ll))

total_count = 0

for line in f:
    line = line.strip()
    if any(func(line) in set_ for func, set_ in hamming_sets.items()):
        total_count += 1
Peter DeGlopper
  • 36,326
  • 7
  • 90
  • 83
Adam Smith
  • 52,157
  • 12
  • 73
  • 112
  • I want to count the instances when the closest match has Hamming distance of 2. I don't want to over count. For example a string in f can have matches in l with Hamming distance 0,1,2,3,4. I want to take the closest match (in this case Hamming distance 0) and increment the counter once – Ssank Jan 17 '17 at 20:29
  • Which is exactly what this does -- the `break` in the `if` block stops all further iteration. This is known as "early exit" – Adam Smith Jan 17 '17 at 20:30
  • N.B. @Ssank that your code doesn't *as an end result* care about finding the closest match, it only cares if you've found a match with distance <= 2. You'll continue through the whole `l` list until you've found the closest match, but the result doesn't depend on it. If your intended result DOES depend on finding the closest match in all cases, you should amend your example code to match. Otherwise, this early exit will provide on average a much faster run time – Adam Smith Jan 17 '17 at 20:32
  • I think the example code to implement the `hamming_sets` needs to maintain spacing with a placeholder like the `?` in the example - just recording `AB` for pattern `ABCD` would mean that line `CDAB` would look like a match. – Peter DeGlopper Jan 17 '17 at 20:48
  • @PeterDeGlopper it shouldn't, since we're pulling the same characters out of the test line. The last edit might make that more clear – Adam Smith Jan 17 '17 at 20:50
  • @PeterDeGlopper probably not, since it's Python. A two-character array would be smaller than a four-character array in lower-level languages, but I'm sure Python abstracts them to take the same amount of memory regardless, and that's all you're really saving! – Adam Smith Jan 17 '17 at 20:53
  • By the by, this is the first time I've had to wonder to myself: "Can I key a dictionary by a function?" Turns out: yes! – Adam Smith Jan 17 '17 at 20:59
  • 1
    My implementation shows `sys.getsizeof("AB")` as 39, `sys.getsizeof("AB??")` as 41. So the object overhead clearly dominates, but the extra characters make a small difference. Not likely to matter in this case, and different implementations might behave differently. – Peter DeGlopper Jan 17 '17 at 21:00