I am working to produce a Python script that can find the (longest possible) length of all n-word-length substrings shared by two strings, disregarding trailing punctuation. Given two strings:
"this is a sample string"
"this is also a sample string"
I want the script to identify that these strings have a sequence of 2 words in common ("this is") followed by a sequence of 3 words in common ("a sample string"). Here is my current approach:
a = "this is a sample string"
b = "this is also a sample string"
aWords = a.split()
bWords = b.split()
#create counters to keep track of position in string
currentA = 0
currentB = 0
#create counter to keep track of longest sequence of matching words
matchStreak = 0
#create a list that contains all of the matchstreaks found
matchStreakList = []
#create binary switch to control the use of while loop
continueWhileLoop = 1
for word in aWords:
currentA += 1
if word == bWords[currentB]:
matchStreak += 1
#to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
if currentB + 1 < len(bWords):
currentB += 1
#in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
if currentA == len(aWords):
matchStreakList.append(matchStreak)
elif word != bWords[currentB]:
#because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
if matchStreak >= 1:
matchStreakList.append(matchStreak)
matchStreak = 0
while word != bWords[currentB]:
#the two words don't match. If you can move b forward one word, do so, then check for another match
if currentB + 1 < len(bWords):
currentB += 1
#if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
elif currentB + 1 == len(bWords):
currentB = 0
break
if word == bWords[currentB]:
matchStreak += 1
#now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
if currentB + 1 < len(bWords):
currentB += 1
elif currentB + 1 == len(bWords):
#we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
if currentA == len(aWords):
matchStreakList.append(matchStreak)
currentB = 0
break
print matchStreakList
This script correctly outputs the (maximum) lengths of the common word-length substrings (2, 3), and has done so for all tests so far. My question is: Is there a pair of two strings for which the approach above will not work? More to the point: Are there extant Python libraries or well-known approaches that can be used to find the maximum length of all n-word-length substrings that two strings share?
[This question is distinct from the longest common substring problem, which is only a special case of what I'm looking for (as I want to find all common substrings, not just the longest common substring). This SO post suggests that methods such as 1) cluster analysis, 2) edit distance routines, and 3) longest common sequence algorithms might be suitable approaches, but I didn't find any working solutions, and my problem is perhaps slightly easier that that mentioned in the link because I'm dealing with words bounded by whitespace.]
EDIT:
I'm starting a bounty on this question. In case it will help others, I wanted to clarify a few quick points. First, the helpful answer suggested below by @DhruvPathak does not find all maximally-long n-word-length substrings shared by two strings. For example, suppose the two strings we are analyzing are:
"They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
and
"You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"
In this case, the list of maximally long n-word-length substrings (disregarding trailing punctuation) is:
all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
Using the following routine:
#import required packages
import difflib
#define function we'll use to identify matches
def matches(first_string,second_string):
s = difflib.SequenceMatcher(None, first_string,second_string)
match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
return match
a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"
a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
print matches(a,b)
One gets output:
['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']
In the first place, I am not sure how one could select from this list the substrings that contain only whole words. In the second place, this list does not include "are", one of the desired maximally-long common n-word-length substrings. Is there a method that will find all of the maximally long n-word-long substrings shared by these two strings ("You are all..." and "They all are...")?