20

I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.

I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down. Yes, as was it a plagarism program...

I want to hear from you people what kind of algoritms you would recommend.

I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.

I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?

Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.

Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?

Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?

Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.

Thanks for your time.

C. K. Young
  • 219,335
  • 46
  • 382
  • 435
N00programmer
  • 1,111
  • 4
  • 13
  • 17
  • Take a look at: http://en.wikipedia.org/wiki/Diff#Algorithm – YXD Apr 26 '11 at 17:31
  • Will surely help you, however, not from scratch. How about you giving it a start and get to point from where we can take it up. Thats the way homework/school project should work. – Omnipotent Apr 26 '11 at 17:31
  • Err..sorry, I'm not getting you. I think I did start that why I'm asking the questions I'm not understanding. Oh, I'm not at the program making yet, I'm at the understanding the things part. I like to get it before using it. – N00programmer Apr 26 '11 at 17:57

3 Answers3

12

Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.

Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.

I've never used Hamming distance for this purpose, so I can't say much about it.

I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.

One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.

For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.

Community
  • 1
  • 1
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Thanks. The cosine implementation I did see since I went through most of the topics related to similarity. It's the understanding the vector from text part where I'm lost; unique words, eh... About the inverted thesaurus, I'm going to use this lexical database called wordnet for that: http://wordnet.princeton.edu/ It's a whole massive projects they have been doing. That's for much later though. Right now it's understanding the algorithms before I start playing with them :D – N00programmer Apr 26 '11 at 18:35
  • Ahh never mind the vector deal. I just found this nice example that explains it to me just as I prefer. Here is it if there's anyone who wants to read it too: http://stackoverflow.com/questions/1746501/can-someone-give-an-example-of-cosine-similarity-in-very-simple-graphical-way – N00programmer Apr 26 '11 at 18:40
  • 1
    This is good *except* for the fact that TFIDF isn't really a solution for calculating similarity. Its purpose is to determine how important a term is to the entire document. Thus, it can be used to calculate cosine similarity (as the answer you link to shows). – Jason Baker Jul 21 '11 at 23:22
3

Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.

Just some hints, the Winnowing: Local Algorithms for Document Fingerprinting maybe a choice and a good start point to your problem.

Summer_More_More_Tea
  • 12,740
  • 12
  • 51
  • 83
3

Consider the example on wikipedia for Levenshtein distance:

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

   1. kitten → sitten (substitution of 's' for 'k')
   2. sitten → sittin (substitution of 'i' for 'e')
   3. sittin → sitting (insertion of 'g' at the end).

Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".

Paper[] papers = getPapers();
for(int i = 0; i < papers.length - 1; i++) {
    for(int j = i + 1; j < papers.length; j++) {
        Paper first = papers[i];
        Paper second = papers[j];
        int dist = compareSimilarities(first.text,second.text);
        System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist);
    }
}

Compare those results and peg the kids with the lowest distance scores.

In your compareSimilarities method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)

corsiKa
  • 81,495
  • 25
  • 153
  • 204
  • I see. So we save all the words in an array and then depending on the algoritm we compare? So it could be as in the example: `Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".` Here it could take the word text and then compare it with every word of the sentence and see how similar they are until it jumps to the other word and do the exact same thing untill the last word and then spits out the score? – N00programmer Apr 26 '11 at 18:00
  • I would try it treating the text of the paper like one giant word. – corsiKa Apr 26 '11 at 18:11
  • Hmmm...so "texfromthefirstpaper" and "textfromthesecondpaper", and compare them. Though won't that give a totally different result if I took the second giant word and wrote "thesecondpaperstext" since it seems that it compared stringwise? O.o So am I right that while the first one will give more similar result the second won't at all? – N00programmer Apr 28 '11 at 18:33