What algorithm can I use to determine if any sequence of n words appears in two documents?

Question

Given two documents:

Now is the winter of our discontent Made glorious summer by this sun of York; And all the clouds that lour'd upon our house In the deep bosom of the ocean buried. Now are our brows bound with victorious wreaths; Our bruised arms hung up for monuments; Our stern alarums changed to merry meetings, Our dreadful marches to delightful measures.

(Richard III)

Richard (Duke of York): [bangs his goblet thrice on the table] Silence! Silence! For the king!

King (Richard III): [stands, hunched, speaks awkwardly] Now is the summer of our sweet content, [Made?] [err?]-cast winter by these Tudor clouds.

(BlackAdder, Season 1, Episode 1)

What sort of algorithm could someone use to find out which 3-word sequences exist in both documents? If there are any, of course - it's not guaranteed.

In this example, "now is the" is the only 3-word sequence that appears in both documents.

Thanks :) The other option was taking some text from the really cool regex/xhtml response from here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Jedidja, Nov 25 '10 at 14:57

Steve Jessop · Answer 1 · 2010-11-25T15:10:54.067

2

Sounds like you're looking for common substrings of length 3, except that the "characters" your "strings" are composed of, are actually words, not single characters. So, "I have a cunning plan" is a string of length 5.

Suffix trees usually get mentioned at this point: http://en.wikipedia.org/wiki/Generalised_suffix_tree, but whether you need that depends how long your texts are. A couple of nested loops to compare starting from each pair of word boundaries (one from each string) will get the job done eventually.

edited Nov 25 '10 at 15:10

answered Nov 25 '10 at 14:55

Steve Jessop

273,490
39
460
699

If I could be bothered, I'd totally find the longest common substring of Blackadder and the complete works of Shakespeare, and use that. Presumably it's some deliberate Shakespeare quotation used in an episode, but you never know. – Steve Jessop Nov 25 '10 at 15:13
The texts are probably < 1000 words long each. Reading through the Generalised suffix tree material looks interesting, but I may try the quick-and-dirty brute force method first. Either way, thanks for the suggestion :) – Jedidja Nov 25 '10 at 16:45
1

Suffix tree construction is linear in the number of characters assuming a constant size alphabet. Using words instead of characters means that that you'll have to take into account a large alphabet (and use an algorithm specifically designed taking large alphabets into account). The efficiency of this will probably then be something like O(S log S), instead of the expected O(S) for constant size alphabets. – Nabb Nov 26 '10 at 16:09

score 2 · Accepted Answer · answered Nov 26 '10 at 15:42

Short of brute force, the simplest solution would be to make a hash map, with entries for each tuple of words for one text, and then check whether each tuple in the other text is present in the hash map. This would be linear given a constant window (3 words), but we can do this in linear time independent of the window size.

To do this efficiently, we use a rolling hash. The input characters to the rolling hash would be hashes of the words. This technique is known as the Rabin-Karp algorithm.

Thanks for the insightful comment on the generalised suffix tree idea. — Jedidja, Nov 27 '10 at 11:43

What algorithm can I use to determine if any sequence of n words appears in two documents?

2 Answers2