I have two really long lists, and I want to find the longest common sub string for each element of the first list in the second list.
A simplified example is
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
So I want to find the best match for "a_b_c" in L2 ("xy_b_c"), then the best match for "d_e_f" in L2("z_d_e_y"). Best match for me is the string with the longest common characters. In I looked at examples for Levenshtein Distance which works for small lists just fine (http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/), but my list L2 has 163531 elements and It hasn't been able to find even one match for the last 15 minutes..
I do not have a CS background, can someone point me to some better algorithm (or even better, its implementation? :) ) Thanks a ton.
Current code (copied off the link and someone else from stackoverflow):
L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]
def levenshtein_distance(first, second):
"""Find the Levenshtein distance between two strings."""
if len(first) > len(second):
first, second = second, first
if len(second) == 0:
return len(first)
first_length = len(first) + 1
second_length = len(second) + 1
distance_matrix = [[0] * second_length for x in range(first_length)]
for i in range(first_length):
distance_matrix[i][0] = i
for j in range(second_length):
distance_matrix[0][j]=j
for i in xrange(1, first_length):
for j in range(1, second_length):
deletion = distance_matrix[i-1][j] + 1
insertion = distance_matrix[i][j-1] + 1
substitution = distance_matrix[i-1][j-1]
if first[i-1] != second[j-1]:
substitution += 1
distance_matrix[i][j] = min(insertion, deletion, substitution)
return distance_matrix[first_length-1][second_length-1]
for string in L1:
print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]
edit- just hit control+C and it gave me an incorrect(but close) answer after 15 minutes. Thats only for the first string and there's a lot of them left..