4

I'm a python n00b and I'd like some suggestions on how to improve the algorithm to improve the performance of this method to compute the Jaro-Winkler distance of two names.

def winklerCompareP(str1, str2):
"""Return approximate string comparator measure (between 0.0 and 1.0)

USAGE:
  score = winkler(str1, str2)

ARGUMENTS:
  str1  The first string
  str2  The second string

DESCRIPTION:
  As described in 'An Application of the Fellegi-Sunter Model of
  Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler
  and Yves Thibaudeau.

  Based on the 'jaro' string comparator, but modifies it according to whether
  the first few characters are the same or not.
"""

# Quick check if the strings are the same - - - - - - - - - - - - - - - - - -
#
jaro_winkler_marker_char = chr(1)
if (str1 == str2):
    return 1.0

len1 = len(str1)
len2 = len(str2)
halflen = max(len1,len2) / 2 - 1

ass1  = ''  # Characters assigned in str1
ass2  = '' # Characters assigned in str2
#ass1 = ''
#ass2 = ''
workstr1 = str1
workstr2 = str2

common1 = 0    # Number of common characters
common2 = 0

#print "'len1', str1[i], start, end, index, ass1, workstr2, common1"
# Analyse the first string    - - - - - - - - - - - - - - - - - - - - - - - - -
#
for i in range(len1):
    start = max(0,i-halflen)
    end   = min(i+halflen+1,len2)
    index = workstr2.find(str1[i],start,end)
    #print 'len1', str1[i], start, end, index, ass1, workstr2, common1
    if (index > -1):    # Found common character
        common1 += 1
        #ass1 += str1[i]
        ass1 = ass1 + str1[i]
        workstr2 = workstr2[:index]+jaro_winkler_marker_char+workstr2[index+1:]
#print "str1 analyse result", ass1, common1

#print "str1 analyse result", ass1, common1
# Analyse the second string - - - - - - - - - - - - - - - - - - - - - - - - -
#
for i in range(len2):
    start = max(0,i-halflen)
    end   = min(i+halflen+1,len1)
    index = workstr1.find(str2[i],start,end)
    #print 'len2', str2[i], start, end, index, ass1, workstr1, common2
    if (index > -1):    # Found common character
        common2 += 1
        #ass2 += str2[i]
        ass2 = ass2 + str2[i]
        workstr1 = workstr1[:index]+jaro_winkler_marker_char+workstr1[index+1:]

if (common1 != common2):
    print('Winkler: Wrong common values for strings "%s" and "%s"' % \
                (str1, str2) + ', common1: %i, common2: %i' % (common1, common2) + \
                ', common should be the same.')
    common1 = float(common1+common2) / 2.0    ##### This is just a fix #####

if (common1 == 0):
    return 0.0

# Compute number of transpositions    - - - - - - - - - - - - - - - - - - - - -
#
transposition = 0
for i in range(len(ass1)):
    if (ass1[i] != ass2[i]):
        transposition += 1
transposition = transposition / 2.0

# Now compute how many characters are common at beginning - - - - - - - - - -
#
minlen = min(len1,len2)
for same in range(minlen+1):
    if (str1[:same] != str2[:same]):
        break
same -= 1
if (same > 4):
    same = 4

common1 = float(common1)
w = 1./3.*(common1 / float(len1) + common1 / float(len2) + (common1-transposition) / common1)

wn = w + same*0.1 * (1.0 - w)
return wn

Example output

ZIMMERMANN  ARMIENTO    0.814583333
ZIMMERMANN  ZIMMERMANN  1
ZIMMERMANN  CANNONS         0.766666667
CANNONS AKKER           0.8
CANNONS ALDERSON    0.845833333
CANNONS ALLANBY         0.833333333
Pentium10
  • 204,586
  • 122
  • 423
  • 502
Martlark
  • 14,208
  • 13
  • 83
  • 99
  • It would be helpful if you had a few examples that we could work off of ... a few name pairs and the correct scores. Otherwise its tough to know that our edits aren't breaking the function. – JudoWill Apr 30 '10 at 02:40
  • Thanks for editing question. I'll get those examples. – Martlark Apr 30 '10 at 04:11
  • I'm not getting the same outputs that you have for your example cases (except when the strings were exactly the same), but the w's were correct for the examples I tried from the Jaro-Winkler distance Wikipedia page. – Justin Peel Apr 30 '10 at 05:05
  • my outputs are wrong, i fix soon. so soory – Martlark May 12 '10 at 07:17
  • Thanks for all the comments. In the end I rewrote it in C; much, much faster. – Martlark Mar 12 '11 at 05:29

3 Answers3

4

I focused more on optimizing to get more out of Python than on optimizing the algorithm because I don't think that there is much of an algorithmic improvement to be had here. Here are some Python optimizations that I came up with.

(1). Since you appear to be using Python 2.x, change all range()'s to xrange()'s. range() generates the full list of numbers before iterating over them while xrange generates them as needed.

(2). Make the following substitutions for max and min:

start = max(0,i-halflen)

with

start = i - halflen if i > halflen else 0

and

end = min(i+halflen+1,len2)

with

end = i+halflen+1 if i+halflen+1 < len2 else len2

in the first loop and similar ones for the second loop. There's also another min() farther down and a max() near the beginning of the function so do the same with those. Replacing the min()'s and max()'s really helped to reduce the time. These are convenient functions, but more costly than the method I've replaced them with.

(3). Use common1 instead of len(ass1). You've kept track of the length of ass1 in common1 so let's use it rather than calling a costly function to find it again.

(4). Replace the following code:

minlen = min(len1,len2)
for same in xrange(minlen+1):
    if (str1[:same] != str2[:same]):
        break
same -= 1

with

for same in xrange(minlen):
    if str1[same] != str2[same]:
        break

The reason for this is mainly that str1[:same] creates a new string every time through the loop and you will be checking parts that you've already checked. Also, there's no need to check if '' != '' and decrement same afterwards if we don't have to.

(5). Use psyco, a just-in-time compiler of sorts. Once you've downloaded it and installed it, just add the lines

import psyco
psyco.full()

at the top of the file to use it. Don't use psyco unless you do the other changes that I've mentioned. For some reason, when I ran it on your original code it actually slowed it down.

Using timeit, I found that I was getting a decrease in time of about 20% or so with the first 4 changes. However, when I add psyco along with those changes, the code is about 3x to 4x faster than the original.

If you want more speed

A fair amount of the remaining time is in the string's find() method. I decided to try replacing this with my own. For the first loop, I replaced

index = workstr2.find(str1[i],start,end)

with

index = -1
for j in xrange(start,end):
    if workstr2[j] == str1[i]:
        index = j
        break

and a similar form for the second loop. Without psyco, this slows down the code, but with psyco, it speeds it up quite a lot. With this final change the code is about 8x to 9x faster than the original.

If that isn't fast enough

Then you should probably turn to making a C module.

Good luck!

Justin Peel
  • 46,722
  • 6
  • 58
  • 80
  • These changes improved performance of my test scripts by 25% without pysco. Thanks! – Martlark May 04 '10 at 07:57
  • xrange v range does not seem to do much using python 2.6.5 – Martlark May 05 '10 at 06:04
  • Yeah, maybe so. I haven't really looked at the changes in 2.6.5. Plus, we're not using very big of ranges so it shouldn't have a big effect. The reason to use xrange is doesn't form the whole list in memory before iterating over it. Instead, it generates each element as needed. – Justin Peel May 05 '10 at 06:18
4

I imagine you could do even better if you used the PyLevenshtein module. It's C and quite fast for most use cases. It includes a jaro-winkler function that gives the same output, but on my machine it's 63 times faster.

In [1]: import jw

In [2]: jw.winklerCompareP('ZIMMERMANN', 'CANNONS')
Out[2]: 0.41428571428571426

In [3]: timeit jw.winklerCompareP('ZIMMERMANN', 'CANNONS')
10000 loops, best of 3: 28.2 us per loop

In [4]: import Levenshtein

In [5]: Levenshtein.jaro_winkler('ZIMMERMANN', 'CANNONS')
Out[5]: 0.41428571428571431

In [6]: timeit Levenshtein.jaro_winkler('ZIMMERMANN', 'CANNONS')
1000000 loops, best of 3: 442 ns per loop
chmullig
  • 13,006
  • 5
  • 35
  • 52
  • pypy 1.6 can make the original code (which was actually taken from Febrl: http://sf.net/projects/febrl) runs 10 times faster. Just give it a little more time :) – sayap Sep 03 '11 at 00:27
  • I did end up using C to make a python module that was many times faster. And fixed a logic error. Thanks for the suggestion. – Martlark Nov 29 '16 at 11:06
0

In addition to everything that Justin says, concatenating strings is expensive - python has to allocate memory for the new string then copy both strings into it.

So this is bad:

ass1 = ''
for i in range(len1):
     ...
    if (index > -1):    # Found common character
        ...
        ass1 = ass1 + str1[i]

It will probably be faster to make ass1 and ass2 lists of characters and use ass1.append(str1[i]). As far as I can see from my quick read of the code the only thing you do with ass1 and ass2 afterwards is to iterate through them character by character so they do not need to be strings. If you did need to use them as strings later then you can convert them with ''.join(ass1).

Dave Kirby
  • 25,806
  • 5
  • 67
  • 84
  • I tried to make ass1 and ass2 lists and it slowed it down which surprised me. I also tried making workstr1 and workstr2 lists which also slowed it down for me. I really thought that at least one or the other would help, but string concatenation must be quite fast or something. – Justin Peel Apr 30 '10 at 14:40
  • i did not notice any speed up when doing this as well – Martlark May 03 '10 at 06:11