0

I have two text documents and want to get the word matches between the two documents. The words can match anywhere - for instance, word#5 of doc1 can match word#3 and word#67 of doc2; and then word#23 of doc1 can again match word#3 and word#67 of doc2 - so I want all the matches. Also, aside from one-word matches I want to similarly get consecutive multiple (2-word, 3-word ....15-word etc) word matches between the two documents. How should I approach this in Java? I have been looking at regular expressions but am still not convinced on the exact approach.

abhishek
  • 817
  • 6
  • 19
  • 33
  • simple brute force matching by running the loop for every word, then every pair of words and so on. regexes seem to offer a way to do this efficiently but haven't been able to comprehend the approach. – abhishek Oct 03 '12 at 13:40

2 Answers2

1

First, you need to split the document into bunches of n words (1 word, 2 words, 3 words, ..., n-words) - those bunches are called n-grams. Refer here.

Secondly, create a Set of n-grams from document A. Then, for each n-gram from document B, check if it is in the set.

Community
  • 1
  • 1
pbetkier
  • 1,657
  • 1
  • 13
  • 10
0

I suggest you to maintain a treeset of single words for each document, looping through the firs treeset and checking the matches versus the second should accomplish your task.

For the multiple words part use the same trick only getting two words groups, eg

word1 word2 word3 yay!

take word1 word2 and put it in the treeset, then take word2 word3 and do the same. You could use regexes to remove punctuation, so the algorithm should consist of three steps:

  1. cleaning the documents from punctuation
  2. "indexing" the words and the groups of consecutive words
  3. comparison phase

About point 1 be careful because for example those phrases are the same:

I ate, the cat didn't, I did

I ate the cat, didnt' I? did!

Community
  • 1
  • 1
Gabber
  • 5,152
  • 6
  • 35
  • 49