Longest common substring between two long lists

Question

I have two really long lists, and I want to find the longest common sub string for each element of the first list in the second list.

A simplified example is

L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]

So I want to find the best match for "a_b_c" in L2 ("xy_b_c"), then the best match for "d_e_f" in L2("z_d_e_y"). Best match for me is the string with the longest common characters. In I looked at examples for Levenshtein Distance which works for small lists just fine (http://www.stavros.io/posts/finding-the-levenshtein-distance-in-python/), but my list L2 has 163531 elements and It hasn't been able to find even one match for the last 15 minutes..

I do not have a CS background, can someone point me to some better algorithm (or even better, its implementation? :) ) Thanks a ton.

Current code (copied off the link and someone else from stackoverflow):

L1= ["a_b_c","d_e_f"]
L2=["xx""xy_a","xy_b_c","z_d_e","zl_d","z_d_e_y"]

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

for string in L1:
    print sorted(L2,key = lambda x:levenshtein_distance(x,string))[0]

edit- just hit control+C and it gave me an incorrect(but close) answer after 15 minutes. Thats only for the first string and there's a lot of them left..

Ashwini Chaudhary · Accepted Answer · 2013-12-06T09:57:41.997

8

Use the difflib module:

>>> from functools import partial
>>> from difflib import SequenceMatcher
def func(x, y):
    s = SequenceMatcher(None, x, y)
    return s.find_longest_match(0, len(x), 0, len(y)).size
... 
for item in L1:
    f = partial(func, item)
    print max(L2, key=f)
...     
xy_b_c
z_d_e_y

edited Dec 06 '13 at 09:57

answered Dec 06 '13 at 09:52

Ashwini Chaudhary

244,495
58
464
504

Thanks, that works for smaller lists for sure..Not sure how well it will perform for my huge lists though.Running now to see. – Illusionist Dec 06 '13 at 10:02
much faster, thanks !! Not very fast but I can live. Will mark as question answered tomorrow if no one else replies, but I am hoping someone else might have a quicker solution . – Illusionist Dec 06 '13 at 10:04
No idea such library exists. +1. – aIKid Dec 06 '13 at 12:19
1

It seems difflib gets buggy when working with long strings: http://stackoverflow.com/questions/20875795/python-passing-sequencematcher-in-difflib-an-autojunk-false-flag-yields-error – duhaime Jan 02 '14 at 04:06

score 2 · Answer 2 · answered Dec 06 '13 at 10:30

2

You can also take a look at The Levenshtein Python C extension module. If I tested it on a random string in your example, it appeared to be about 150 times faster than the python implementation. And use max as shown by Ashwini Chaudhary.

answered Dec 06 '13 at 10:30

root

76,608
25
108
120

Longest common substring between two long lists

2 Answers2