I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.
I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down. Yes, as was it a plagarism program...
I want to hear from you people what kind of algoritms you would recommend.
I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.
I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?
Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.
Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?
Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?
Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.
Thanks for your time.