4

I have two really long lists, and I want to find the longest common sub string for each element of the first list in the second list.

A simplified example is

L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]

So I want to find the best match for "a_b_c" in L2 ("xy_b_c"), then the best match for "d_e_f" in L2("z_d_e_y"). Best match for me is the string with the longest common characters. In I looked at examples for Levenshtein Distance which works for small lists just fine (http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/), but my list L2 has 163531 elements and It hasn't been able to find even one match for the last 15 minutes..

I do not have a CS background, can someone point me to some better algorithm (or even better, its implementation? :) ) Thanks a ton.

Current code (copied off the link and someone else from stackoverflow):

L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

for string in L1:
    print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]

edit- just hit control+C and it gave me an incorrect(but close) answer after 15 minutes. Thats only for the first string and there's a lot of them left..

Illusionist
  • 5,204
  • 11
  • 46
  • 76

2 Answers2

8

Use the difflib module:

>>> from functools import partial
>>> from difflib import SequenceMatcher
def func(x, y):
    s = SequenceMatcher(None, x, y)
    return s.find_longest_match(0, len(x), 0, len(y)).size
... 
for item in L1:
    f = partial(func, item)
    print max(L2, key=f)
...     
xy_b_c
z_d_e_y
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • Thanks, that works for smaller lists for sure..Not sure how well it will perform for my huge lists though.Running now to see. – Illusionist Dec 06 '13 at 10:02
  • much faster, thanks !! Not very fast but I can live. Will mark as question answered tomorrow if no one else replies, but I am hoping someone else might have a quicker solution . – Illusionist Dec 06 '13 at 10:04
  • No idea such library exists. +1. – aIKid Dec 06 '13 at 12:19
  • 1
    It seems difflib gets buggy when working with long strings: http://stackoverflow.com/questions/20875795/python-passing-sequencematcher-in-difflib-an-autojunk-false-flag-yields-error – duhaime Jan 02 '14 at 04:06
2

You can also take a look at The Levenshtein Python C extension module. If I tested it on a random string in your example, it appeared to be about 150 times faster than the python implementation. And use max as shown by Ashwini Chaudhary.

root
  • 76,608
  • 25
  • 108
  • 120