I have two text documents and want to get the word matches between the two documents. The words can match anywhere - for instance, word#5 of doc1 can match word#3 and word#67 of doc2; and then word#23 of doc1 can again match word#3 and word#67 of doc2 - so I want all the matches. Also, aside from one-word matches I want to similarly get consecutive multiple (2-word, 3-word ....15-word etc) word matches between the two documents. How should I approach this in Java? I have been looking at regular expressions but am still not convinced on the exact approach.
2 Answers
First, you need to split the document into bunches of n words (1 word, 2 words, 3 words, ..., n-words) - those bunches are called n-grams. Refer here.
Secondly, create a Set of n-grams from document A. Then, for each n-gram from document B, check if it is in the set.
I suggest you to maintain a treeset of single words for each document, looping through the firs treeset and checking the matches versus the second should accomplish your task.
For the multiple words part use the same trick only getting two words groups, eg
word1 word2 word3 yay!
take word1 word2
and put it in the treeset, then take word2 word3
and do the same. You could use regexes to remove punctuation, so the algorithm should consist of three steps:
- cleaning the documents from punctuation
- "indexing" the words and the groups of consecutive words
- comparison phase
About point 1 be careful because for example those phrases are the same:
I ate, the cat didn't, I did
I ate the cat, didnt' I? did!