I am trying to match a list of lastnames to a list of full names using Python 2.7 and the Levenshtein function. To reduce workload I only match if the first letters are identical (although this doesn't seem to make much of a difference performance-wise). If a match is found the matching word is removed from the full names (to make a subsequent first name matching easier). Both lists contain several ten thousand entries, so my solution is rather slow. How could I speed things up without parsing the fullnames? Here is what I have so far (I have omitted a few if-conditions for cases where the lastnames consist of several words):
import Levenshtein
listoflastnames=(['Jones', 'Sallah'])
listoffullnames=(['Henry', 'Jones', 'Junior'],['Indiana', 'Jones'])
def match_strings(lastname, listofnames):
match=0
matchedidx=[]
for index, nameelement in enumerate(listofnames):
if lastname[0]==nameelement [0]:
if Levenshtein.distance(nameelement, lastname)<2:
matchedidx.append(index)
match=match+1
if match==1:
newnamelist = [i for j, i in enumerate(listofnames) if j not in matchedidx]
return 1, newnamelist
return 0, listofnames
for x in listoflastnames:
for y in listoffullnames:
match, newlistofnames=match_strings(x,y)
if match==1:
#go to first name match...
Any help would be appreciated!
Update: in the meantime I have used the multiprocessing module to let all of my 4 cores handle the issue instead of just one, but the matching still takes a lot of time.