3

I am trying to match a list of lastnames to a list of full names using Python 2.7 and the Levenshtein function. To reduce workload I only match if the first letters are identical (although this doesn't seem to make much of a difference performance-wise). If a match is found the matching word is removed from the full names (to make a subsequent first name matching easier). Both lists contain several ten thousand entries, so my solution is rather slow. How could I speed things up without parsing the fullnames? Here is what I have so far (I have omitted a few if-conditions for cases where the lastnames consist of several words):

import Levenshtein

listoflastnames=(['Jones', 'Sallah'])
listoffullnames=(['Henry', 'Jones', 'Junior'],['Indiana', 'Jones'])


def match_strings(lastname, listofnames):
    match=0
    matchedidx=[]
        for index, nameelement in enumerate(listofnames):        
            if lastname[0]==nameelement [0]:
                if Levenshtein.distance(nameelement, lastname)<2:
                    matchedidx.append(index)
                    match=match+1
    if match==1:
        newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
    return 1, newnamelist 
return 0, listofnames



for x in listoflastnames:
    for y in listoffullnames:
        match, newlistofnames=match_strings(x,y)
        if match==1:
            #go to first name match...

Any help would be appreciated!

Update: in the meantime I have used the multiprocessing module to let all of my 4 cores handle the issue instead of just one, but the matching still takes a lot of time.

MrFancypants
  • 440
  • 3
  • 11
  • ´Levenshtein.distance(g, publastnames[0]´ what is g and publastnames[0] here? – M4rtini Dec 09 '13 at 16:35
  • Sorry, that was a left-over from an older version. The Levenshtein function compares a last name and one word from a full name. I have corrected the mistake. – MrFancypants Dec 09 '13 at 16:48
  • 1
    If you're only going to perform the computations where the first letter is the same, you might want to break up the lists into dictionaries indexed by the first letter. Then you can do the comparisons only among viable candidates, instead of among everybody. Whether this will improve performance depends upon what fraction of the time is spent on this overhead as opposed to the distance calculation. – DSM Dec 09 '13 at 16:50
  • 1
    You could first split names into sublist by lengths. Two strings with `n` difference in length cannot have Levenstein difference less than `n`. –  Dec 09 '13 at 18:48

1 Answers1

1

This simplifies the for loop in the match_string function, but didn't increase the speed noticeably in my tests. The biggest loss is in the two for loops with lastnames and fullnames.

def match_strings(lastname, listofnames):
    firstCaseMatched = [name for name in listofnames if lastname[0] == name[0]]
    if len(firstCaseMatched):
        matchedidx = [index for index, ame in enumerate(firstCaseMatched) if Levenshtein.distance(lastname, name) < 2]
        match = len(matchedidx)
    else:
        match = 0
    if match == 1:
        newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
        return 1, newnamelist
    return 0, listofnames

You might have to sort the list of known last names, split them into a dict for each starting character. And then match each name in the list of names against that.

Assuming the fullnames list always has the first name as first element. You could limit the comparison to only the other elements.

Tom Zych
  • 13,329
  • 9
  • 36
  • 53
M4rtini
  • 13,186
  • 4
  • 35
  • 42
  • Thanks for all your suggestions. I have split the lastnames into a dictionary with first letters as keys as suggested. Together with multiprocessing the script is now about 20 times as fast as the original version. – MrFancypants Dec 10 '13 at 18:52
  • Btw, i assume you use Levenshtein distance since the names might be misspelled? Can you be sure the first letter in the name is correct then? – M4rtini Dec 10 '13 at 21:01
  • I can't be sure, but when I tested the script without excluding cases where the first letter matches for a subset of my data it seemed like a lot of false positives were introduced. As I'll have to remove false positives manually from the resulting set I'm willing to accept a slightly lower recall rate :) – MrFancypants Dec 10 '13 at 21:16
  • Not sure how expensive the Levenshtein algorith is. So this might not help at all. First you could try to do a direct comparison, assuming the direct comparison is significantly faster.(it's might not be for the cases where they are equal). And then do levenshtein on those that don't match. Second, if you've already matched the first character, only do levenshtein test on the rest of the name (lastname[1:]) – M4rtini Dec 10 '13 at 21:57
  • I tried both, surprisingly both ideas slow the script down very slightly according to cprofiler. Probably because the Levenshtein function is written in C. – MrFancypants Dec 12 '13 at 17:00