0

How to optimize this edit distance code i.e. finding the number of bits changed between 2 values! e.g. word1 = '010000001000011111101000001001000110001' word2 = '010000001000011111101000001011111111111'

When i tried to run on Hadoop it takes ages to complete?

How to reduce the for loop and comparsions ?

#!/usr/bin/python

import os, re, string, sys

from numpy import zeros

def calculateDistance(word1, word2):

    x = zeros( (len(word1)+1, len(word2)+1) )

    for i in range(0,len(word1)+1):

        x[i,0] = i

    for i in range(0,len(word2)+1):

        x[0,i] = i

    for j in range(1,len(word2)+1):

        for i in range(1,len(word1)+1):

            if word1[i-1] == word2[j-1]:

                x[i,j] = x[i-1,j-1]

            else:

                minimum = x[i-1, j] + 1

                if minimum > x[i, j-1] + 1:

                    minimum = x[i, j-1] + 1

                if minimum > x[i-1, j-1] + 1:

                    minimum = x[i-1, j-1] + 1

                x[i,j] = minimum

    return x[len(word1), len(word2)]
ddd
  • 11
  • 1
  • 1
    Perhaps you could explain your algorithm a bit more? I don't see why you would have nested loops at all; don't you just want to compare each pair of corresponding bits? – Karl Knechtel Aug 12 '11 at 06:23
  • @Karl: depends on the actual edit distance metric the OP is using (Hamming, Levenshtein, Jaro-Winkler, other?). – Michael Foukarakis Aug 12 '11 at 06:59

3 Answers3

4

I looked for a bit counting algorithm online, and I found this page, which has several good algorithms. My favorite there is a one-line function which claims to work for Python 2.6 / 3.0:

return sum( b == '1' for b in bin(word1 ^ word2)[2:] )

I don't have Python, so I can't test, but if this one doesn't work, try one of the others. The key is to count the number of 1's in the bitwise XOR of your two words, because there will be a 1 for each difference.

You are calculating the Hamming distance, right?

EDIT: I'm trying to understand your algorithm, and the way you're manipulating the inputs, it looks like they are actually arrays, and not just binary numbers. So I would expect that your code should look more like:

return sum( a != b for a, b in zip(word1, word2) )

EDIT2: I've figured out what your code does, and it's not the Hamming distance at all! It's actually the Levenshtein distance, which counts how many additions, deletions, or substitutions are needed to turn one string into another (the Hamming distance only counts substitutions, and so is only suitable for equal length strings of digits). Looking at the Wikipedia page, your algorithm is more or less a straight port of the pseudocode they have there. As they point out, the time and space complexity of a comparison of strings of length m and n is O(mn), which is pretty bad. They have a few suggestions of optimizations depending on your needs, but I don't know what you use this function for, so I can't say what would be best for you. If the Hamming distance is good enough for you, the code above should suffice (time complexity O(n)), but it gives different results on some sets of strings, even if they are of equal length, like '0101010101' and '1010101010', which have Hamming distance 10 (flip all bits) and Levenshtein distance 2 (remove the first 0 and add it at the end)

Community
  • 1
  • 1
Mario Carneiro
  • 1,548
  • 1
  • 18
  • 32
  • +1, your second answer looks good to me, and gives the same result (7) with considerably less fuss! (N.B. the inputs are just strings, which do behave like arrays). – Scott Griffiths Aug 12 '11 at 07:15
  • Your solutions will only work correctly for words of equal size. – Michael Foukarakis Aug 12 '11 at 07:21
  • @Michael I am aware of that, but that's not so much my fault as Hamming's - if you use the Hamming distance definition, you must also make the assumption that the sizes of the words are equal. Initially, the OP's text suggested that he wanted the Hamming distance (and his example uses two equal-sized words), but his code is a Levenshtein distance calculator, which need not make such an assumption. – Mario Carneiro Aug 12 '11 at 08:48
  • That's true, I guess lack of caffeine failed me. :-) – Michael Foukarakis Aug 12 '11 at 08:49
3

Since you haven't specified what edit distance you're using yet, I'm gonna go on a limb and assume it's Levenshtein distance. In which case, you can shave off some operations here and there:

def levenshtein(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space.
        # Not really important to the algorithm anyway.
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)

    return current[n]

edit: also, you make no mention of your dataset. According to its characteristics, the implementation might change to benefit from it.

Michael Foukarakis
  • 39,737
  • 6
  • 87
  • 123
0

Your algorithm seems to do a lot of work. It compares every bit to all bits in the opposite bit vector, meaning you get an algorithmic complexity of O(m*n). That is unnecessary if you are computing Hamming distance, so I assume you're not.

Your loop builds an x[i,j] matrix looking like this:

   0  1  0  0  0  0  0  0  1  0  0 ... (word1)
0  0  1  0  0  0  0  0  0  1
1  1  0  1  1  1  1  1  1  0
0  0  1  0  1  1  1  1  1  1
0  0  1  1  0  1  1  1  1  2
0  0  1  1  1  0  1  1  1  2
0  0  1  1  1  1  0  1  1  2
1
1
...
(example word2)

This may be useful for detecting certain types of edits, but without knowing what edit distance algorithm you are trying to implement, I really can't tell you how to optimize it.

Jens Roland
  • 27,450
  • 14
  • 82
  • 104