6

I am working to produce a Python script that can find the (longest possible) length of all n-word-length substrings shared by two strings, disregarding trailing punctuation. Given two strings:

"this is a sample string"

"this is also a sample string"

I want the script to identify that these strings have a sequence of 2 words in common ("this is") followed by a sequence of 3 words in common ("a sample string"). Here is my current approach:

a = "this is a sample string"
b = "this is also a sample string"

aWords = a.split()
bWords = b.split()

#create counters to keep track of position in string
currentA = 0
currentB = 0

#create counter to keep track of longest sequence of matching words
matchStreak = 0

#create a list that contains all of the matchstreaks found
matchStreakList = []

#create binary switch to control the use of while loop
continueWhileLoop = 1

for word in aWords:
    currentA += 1

    if word == bWords[currentB]:
        matchStreak += 1

        #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
        if currentB + 1 < len(bWords):
            currentB += 1

        #in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
        if currentA == len(aWords):
            matchStreakList.append(matchStreak)

    elif word != bWords[currentB]:

        #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
        if matchStreak >= 1:
            matchStreakList.append(matchStreak)
        matchStreak = 0

        while word != bWords[currentB]:

            #the two words don't match. If you can move b forward one word, do so, then check for another match
            if currentB + 1 < len(bWords):
                currentB += 1

            #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
            elif currentB + 1 == len(bWords):
                currentB = 0
                break

        if word == bWords[currentB]:
            matchStreak += 1

            #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
            if currentB + 1 < len(bWords):
                currentB += 1
            elif currentB + 1 == len(bWords):

                #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
                if currentA == len(aWords):
                    matchStreakList.append(matchStreak)
                currentB = 0
                break

print matchStreakList

This script correctly outputs the (maximum) lengths of the common word-length substrings (2, 3), and has done so for all tests so far. My question is: Is there a pair of two strings for which the approach above will not work? More to the point: Are there extant Python libraries or well-known approaches that can be used to find the maximum length of all n-word-length substrings that two strings share?

[This question is distinct from the longest common substring problem, which is only a special case of what I'm looking for (as I want to find all common substrings, not just the longest common substring). This SO post suggests that methods such as 1) cluster analysis, 2) edit distance routines, and 3) longest common sequence algorithms might be suitable approaches, but I didn't find any working solutions, and my problem is perhaps slightly easier that that mentioned in the link because I'm dealing with words bounded by whitespace.]

EDIT:

I'm starting a bounty on this question. In case it will help others, I wanted to clarify a few quick points. First, the helpful answer suggested below by @DhruvPathak does not find all maximally-long n-word-length substrings shared by two strings. For example, suppose the two strings we are analyzing are:

"They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"

and

"You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

In this case, the list of maximally long n-word-length substrings (disregarding trailing punctuation) is:

all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every

Using the following routine:

#import required packages
import difflib

#define function we'll use to identify matches
def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()

print matches(a,b)

One gets output:

['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']

In the first place, I am not sure how one could select from this list the substrings that contain only whole words. In the second place, this list does not include "are", one of the desired maximally-long common n-word-length substrings. Is there a method that will find all of the maximally long n-word-long substrings shared by these two strings ("You are all..." and "They all are...")?

Community
  • 1
  • 1
duhaime
  • 25,611
  • 17
  • 169
  • 224
  • Do you want the output to be a list of all common substrings? Is it okay for them to overlap? – Max Spencer Nov 28 '13 at 13:39
  • Try taking a look at [diff-match-patch](http://code.google.com/p/google-diff-match-patch/), its a group of Google codes which does fuzzy string matching, there may be something in there you can use. – wnnmaw Nov 28 '13 at 13:57
  • 1
    But "one two three" and "one two and two three" have two maximally long common substrings that overlap. – RemcoGerlich Nov 28 '13 at 13:57
  • So are you most interested in finding **the** longest substring, or something more complicated? I'm just thinking that if they can't overlap then it seems to make the problem a lot more complicated, because making one substring longer may make another one shorter and you need a scoring system for deciding whether it's better to return to substrings of length 3 or one of length 2 and another of 4. – Max Spencer Nov 28 '13 at 14:02
  • I'm looking for all maximally long common substrings, whether or not they overlap. Sorry for misspeaking above. (I thought overlapping would prevent a substring from being maximally long, but @RemcoGerlich helped show me that this is not the case.) – duhaime Nov 28 '13 at 14:05
  • Is it intentional that "quill" is not in your desired output? "maximally-long common n-word-length substrings" is vague despite its wordiness ;-), but your `a` and `b` strings both end with "quill". Another: what do you want for inputs "x x" and "x x x y x x"? 3 instances of "x x"? 2 instances? Just 1? – Tim Peters Jan 04 '14 at 04:32

4 Answers4

6

There are still ambiguities here, and I don't want to spend time arguing about them. But I think I can add something helpful anyway ;-)

I wrote Python's difflib.SequenceMatcher, and spent a lot of time finding expected-case fast ways to find longest common substrings. In theory, that should be done with "suffix trees", or the related "suffix arrays" augmented with "longest common prefix arrays" (the phrases in quotes are search terms if you want to Google for more). Those can solve the problem in worst-case linear time. But, as is sometimes the case, the worst-case linear-time algorithms are excruciatingly complex and delicate, and suffer large constant factors - they can still pay off hugely if a given corpus is going to be searched many times, but that's not the typical case for Python's difflib and it doesn't look like your case either.

Anyway, my contribution here is to rewrite SequenceMatcher's find_longest_match() method to return all the (locally) maximum matches it finds along the way. Notes:

  1. I'm going to use the to_words() function Raymond Hettinger gave you, but without the conversion to lower case. Converting to lower case leads to output that isn't exactly like what you said you wanted.

  2. Nevertheless, as I noted in a comment already, this does output "quill", which isn't in your list of desired outputs. I have no idea why it isn't, since "quill" does appear in both inputs.

Here's the code:

import re
def to_words(text):
    'Break text into a list of words without punctuation'
    return re.findall(r"[a-zA-Z']+", text)

def match(a, b):
    # Make b the longer list.
    if len(a) > len(b):
        a, b = b, a
    # Map each word of b to a list of indices it occupies.
    b2j = {}
    for j, word in enumerate(b):
        b2j.setdefault(word, []).append(j)
    j2len = {}
    nothing = []
    unique = set() # set of all results
    def local_max_at_j(j):
        # maximum match ends with b[j], with length j2len[j]
        length = j2len[j]
        unique.add(" ".join(b[j-length+1: j+1]))
    # during an iteration of the loop, j2len[j] = length of longest
    # match ending with b[j] and the previous word in a
    for word in a:
        # look at all instances of word in b
        j2lenget = j2len.get
        newj2len = {}
        for j in b2j.get(word, nothing):
            newj2len[j] = j2lenget(j-1, 0) + 1
        # which indices have not been extended?  those are
        # (local) maximums
        for j in j2len:
            if j+1 not in newj2len:
                local_max_at_j(j)
        j2len = newj2len
    # and we may also have local maximums ending at the last word
    for j in j2len:
        local_max_at_j(j)
    return unique

Then:

a = "They all are white a sheet of spotless paper " \
    "when they first are born but they are to be " \
    "scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless " \
    "paper, when you first are born; but you are to " \
    "be scrawled and blotted by every goose's quill"

print match(to_words(a), to_words(b))

displays:

set(['all',
     'and blotted by every',
     'first are born but',
     'are to be scrawled',
     'are',
     'spotless paper when',
     'white a sheet of',
     'quill'])

EDIT - how it works

A great many sequence matching and alignment algorithms are best understood as working on a 2-dimensional matrix, with rules for computing the matrix entries and later interpreting the entries' meaning.

For input sequences a and b, picture a matrix M with len(a) rows and len(b) columns. In this application, we want M[i, j] to contain the length of the longest common contiguous subsequence ending with a[i] and b[j], and the computational rules are very easy:

  1. M[i, j] = 0 if a[i] != b[j].
  2. M[i, j] = M[i-1, j-1] + 1 if a[i] == b[j] (where we assume an out-of-bounds matrix reference silently returns 0).

Interpretation is also very easy in this case: there is a locally maximum non-empty match ending at a[i] and b[j], of length M[i, j], if and only if M[i, j] is non-zero but M[i+1, j+1] is either 0 or out-of-bounds.

You can use those rules to write very simple & compact code, with two loops, that computes M correctly for this problem. The downside is that the code will take (best, average and worst cases) O(len(a) * len(b)) time and space.

While it may be baffling at first, the code I posted is doing exactly the above. The connection is obscured because the code is heavily optimized, in several ways, for expected cases:

  • Instead of doing one pass to compute M, then another pass to interpret the results, computation and interpretation are interleaved in a single pass over a.

  • Because of that, the whole matrix doesn't need to be stored. Instead only the current row (newj2len) and the previous row (j2len) are simultaneously present.

  • And because the matrix in this problem is usually mostly zeroes, a row here is represented sparsely, via a dict mapping column indices to non-zero values. Zero entries are "free", in that they're never stored explicitly.

  • When processing a row, there's no need to iterate over each column: the precomputed b2j dict tells us exactly the interesting column indices in the current row (those columns that match the current word from a).

  • Finally, and partly by accident, all the preceeding optimizations conspire in such a way that there's never a need to know the current row's index, so we don't have to bother computing that either.

EDIT - the dirt simple version

Here's code that implements the 2D matrix directly, with no attempts at optimization (other than that a Counter can often avoid explicitly storing 0 entries). It's extremely simple, short and easy:

def match(a, b):
    from collections import Counter
    M = Counter()
    for i in range(len(a)):
        for j in range(len(b)):
            if a[i] == b[j]:
                M[i, j] = M[i-1, j-1] + 1
    unique = set()
    for i in range(len(a)):
        for j in range(len(b)):
            if M[i, j] and not M[i+1, j+1]:
                length = M[i, j]
                unique.add(" ".join(a[i+1-length: i+1]))
    return unique

Of course ;-) that returns the same results as the optimized match() I posted at first.

EDIT - and another without a dict

Just for fun :-) If you have the matrix model down pat, this code will be easy to follow. A remarkable thing about this particular problem is that a matrix cell's value only depends on the values along the diagonal to the cell's northwest. So it's "good enough" just to traverse all the main diagonals, proceeding southeast from all the cells on the west and north borders. This way only small constant space is needed, regardless of the inputs' lengths:

def match(a, b):
    from itertools import chain
    m, n = len(a), len(b)
    unique = set()
    for i, j in chain(((i, 0) for i in xrange(m)),
                      ((0, j) for j in xrange(1, n))):
        k = 0
        while i < m and j < n:
            if a[i] == b[j]:
                k += 1
            elif k:
                unique.add(" ".join(a[i-k: i]))
                k = 0
            i += 1
            j += 1
        if k:
            unique.add(" ".join(a[i-k: i]))
    return unique
Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • This is truly outstanding. You just rocked my world. It took me a minute to figure out what was going on in `M[i, j] = M[i-1, j-1] + 1`, (mostly because I wasn't yet familiar with the Counter() method) but I've just figured it out and the whole procedure has become so clear. Thank you for this wonderful answer, and for the great explanation as well. – duhaime Jan 06 '14 at 00:43
  • 1
    :-) Glad you found it helpful! Now that you understand the matrix framework, you can go on to make good sense of programs to compute Levenshtein edit distance (and many more). Note that there's not much special about a `Counter` here; for example, `M = collections.defaultdict(int)` would do just as well. A `Counter()` and `defaultdict(int)` both return 0 when passed a key that isn't present, and that's the only special property the very simple `match()` needs. For peak speed, though, you want to stick to standard `dict`s, like the optimized `match()` uses. – Tim Peters Jan 07 '14 at 01:14
  • @TimPeters are you aware off-hand if pypy is compatible with your adapted `match()` (and perhaps `difflib`)? btw, the first `match()` above should probably `return unique` when it's done ;) – drevicko Jun 01 '14 at 20:40
  • @drevicko, don't know about pypy compatibility here. As to the first `match()`, `return unique` is its last statement - perhaps you need to scroll down to see it? – Tim Peters Jun 01 '14 at 20:46
  • Ah, indeed - mac's scroll bars don't appear until you scroll them, so I had assumed they weren't there. I'll give it a whirl with pypy and see what happens (: – drevicko Jun 02 '14 at 07:09
4

There are really four questions embedded in your post.

1) How do you split text into words?

There are many ways to do this depending on what you count as a word, whether you care about case, whether contractions are allowed, etc. A regular expression lets you implement your choice of word splitting rules. The one I typically use is r"[a-z'\-]+". The catches contractions like don't and allow hyphenated words like mother-in-law.

2) What data structure can speed-up the search for common subsequences?

Create a location map showing for each word. For example, in the sentence you should do what you like the mapping for for you is {"you": [0, 4]} because it appears twice, once at position zero and once at position four.

With a location map in hand, it is a simple matter to loop-over the starting points to compare n-length subsequences.

3) How do I find common n-length subsequences?

Loop over all words in the one of the sentences. For each such word, find the places where it occurs in the other sequence (using the location map) and test whether the two n-length slices are equal.

4) How do I find the longest common subsequence?

The max() function finds a maximum value. It takes a key-function such as len() to determine the basis for comparison.

Here is some working code that you can customize to your own interpretation of the problem:

import re

def to_words(text):
    'Break text into a list of lowercase words without punctuation'
    return re.findall(r"[a-z']+", text.lower())

def starting_points(wordlist):
    'Map each word to a list of indicies where the word appears'
    d = {}
    for i, word in enumerate(wordlist):
        d.setdefault(word, []).append(i)
    return d

def sequences_in_common(wordlist1, wordlist2, n=1):
    'Generate all n-length word groups shared by two word lists'
    starts = starting_points(wordlist2)
    for i, word in enumerate(wordlist1):
        seq1 = wordlist1[i: i+n]
        for j in starts.get(word, []):
            seq2 = wordlist2[j: j+n]
            if seq1 == seq2 and len(seq1) == n:
                yield ' '.join(seq1)

if __name__ == '__main__':

    t1 = "They all are white a sheet of spotless paper when they first are " \
         "born but they are to be scrawled upon and blotted by every goose quill"

    t2 = "You are all white, a sheet of lovely, spotless paper, when you first " \
         "are born; but you are to be scrawled and blotted by every goose's quill"

    w1 = to_words(t1)
    w2 = to_words(t2)

    for n in range(1,10):
        matches = list(sequences_in_common(w1, w2, n))
        if matches:
            print(n, '-->', max(matches, key=len))
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
  • 1
    This was wonderfully helpful! Your proposal gave me a chance to learn about dictionaries, enumerate(), and get(), which I had encountered before but didn't have a chance to sit down and study until right now. Thank you for this most helpful answer! – duhaime Jan 04 '14 at 21:56
2

difflib module would be a good candidate for this case , see get_matching_blocks :

import difflib

def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

first_string = "this is a sample string"
second_string = "this is also a sample string"
print matches(second_string, first_string )

demo: http://ideone.com/Ca3h8Z

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
  • This is helpful, but it doesn't identify all common maximally long n-word substrings, which is what I'm after. See above for details... – duhaime Jan 02 '14 at 04:03
0

A slight modification, with matching not chars but words, I suppose will do:

def matche_words(first_string,second_string):
    l1 = first_string.split()
    l2 = second_string.split()
    s = difflib.SequenceMatcher(None, l1, l2)
    match = [l1[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

Demo:

>>> print '\n'.join(map(' '.join, matches(a,b)))
all
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
quill
alko
  • 46,136
  • 12
  • 94
  • 102
  • The first "are" in each string should match as well, but is not matched in this script – duhaime Jan 02 '14 at 13:41
  • @duhaime not so sure, as `all` is matched as well, and `are` in first is after `all`, and before in second. – alko Jan 02 '14 at 14:06
  • @duhaime this is non symmetric matcher. You can use matches(b, a), where are is present but all is absent. – alko Jan 02 '14 at 14:13
  • I take it that `all` should match for the same reason that `are` should match--both are maximally long n-word units that appear in both strings. Do you disagree? – duhaime Jan 02 '14 at 14:47