18

To compute the similarity between two documents, I create a feature vector containing the term frequencies. But then, for the next step, I can't decide between "Cosine similarity" and "Hamming distance".

My question: Do you have experience with these algorithms? Which one gives you better results?

In addition to that: Could you tell me how to code the Cosine similarity in PHP? For Hamming distance, I've already got the code:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += $counts1[$term];
    }
    return $totalScore * 500 / (count($terms1) * count($terms2));
}

I don't want to use any other algorithm. I would only like to have help to decide between both.

And maybe someone can say something to how to improve the algorithms. Will you get better results if you filter out the stop words or common words?

I hope you can help me. Thanks in advance!

caw
  • 30,999
  • 61
  • 181
  • 291

4 Answers4

18

A Hamming distance should be done between two strings of equal length and with the order taken into account.

As your documents are certainly of different length and if the words places do not count, cosine similarity is better (please note that depending your needs, better solutions exist). :)

Here is a cosine similarity function of 2 arrays of words:

function cosineSimilarity($tokensA, $tokensB)
{
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();

    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += $x;
        $c += $y;
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

It is fast (isset() instead of in_array() is a killer on large arrays).

As you can see, the results does not take into account the "magnitude" of each the word.

I use it to detect multi-posted messages of "almost" copy-pasted texts. It works well. :)

The best link about string similarity metrics: http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

For further interesting readings:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html http://bioinformatics.oxfordjournals.org/cgi/content/full/22/18/2298

Toto
  • 2,329
  • 22
  • 40
  • Thank you very much. :) But isn't Mike's solution (selected answer) better? The code is shorter and seems to be as fast as yours. What are the differences? – caw Aug 18 '09 at 18:48
  • 1
    Mike's function is not really exact. Try `echo check(array('a', 'b', 'c'), array('a', 'b', 'c'));` It should return 1 (cos(0)) but his function returns 0.33. :( – Toto Aug 18 '09 at 19:20
  • Is your function really correct? It gives 0.71 for [1, 1, 1] and [1, 1, 0]. But http://www.miislita.com/searchito/binary-similarity-calculator.html gives 0.82?! Is it still necessary do divide the similarity value by the document length? – caw Sep 19 '09 at 23:12
  • 1
    This tool is for *binary string* comparison. "My" function is for "documents of words". The result will not be the same. :) – Toto Sep 21 '09 at 15:57
  • Ok, thanks. I was looking for some tool to compare since I want to be sure this time I have the correct function ;) And I don't need to divide the value by the length of a document since the length doesn't play a role in cosine similarity, right? – caw Sep 21 '09 at 16:28
9

Unless I'm mistaken, I think you've got an algorithm halfway between the two algorithms. For Hamming distance, use:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += 1;
    }
    return $totalScore * 500 / (count($terms1) * count($terms2));
}

(Note that you're only adding 1 for each matched element in the token vectors.)

And for cosine similarity, use:

function check ($terms1, $terms2) {
    $counts1 = array_count_values($terms1);
    $counts2 = array_count_values($terms2);
    $totalScore = 0;
    foreach ($terms2 as $term) {
        if (isset($counts1[$term])) $totalScore += $counts1[$term] * $counts2[$term];
    }
    return $totalScore / (count($terms1) * count($terms2));
}

(Note that you're adding the product of the token counts between the two documents.)

The main difference between the two is that cosine similarity will yield a stronger indicator when two documents have the same word multiple times in the documents, while Hamming distance doesn't care how often the individual tokens come up.

Edit: just noticed your query about removing function words etc. I do advise this if you're going to use cosine similarity - as function words are quite frequent (in English, at least), you might skew a result by not filtering them out. If you use Hamming distance, the effect will not be quite as great, but it could still be appreciable in some cases. Also, if you have access to a lemmatizer, it will reduce the misses when one document contains "galaxies" and the other contains "galaxy", for instance.

Whichever way you go, good luck!

Mike
  • 4,542
  • 1
  • 26
  • 29
  • Thank you very much! So if I'm using a combination of both algorithms, does it also combine their advantages? Than it would be better than these algorithms, right? :) Or should I better use one of your code examples? Your last sentence is quite interesting. So cosine similarity would be better for my purpose, right? Since it expresses a stronger relation between two texts if one word appears often, doesn't it? – caw Jun 03 '09 at 16:51
  • @marco92w: i think cosine similarity is best in this case - also see my recent edit about function words. your intuition was dead on there. – Mike Jun 03 '09 at 16:56
  • Thx, the edit is informative, too. Last question: :) What's the difference between cosine similarity and my algorithm (code in question)? Which one is better? – caw Jun 03 '09 at 17:11
  • Your code, as it stands, favors document one - you are ignoring the token count from document two. Ideally, a similarity algorithm should be symmetric. If you apply the algorithm from A to B, you should get the same result as B to A. If you tried that with your code, you'd get different results. Now, if you have a "preferred" document, your code might just be what you're looking for! – Mike Jun 03 '09 at 17:15
  • My algorithm doesn't favor or prefer any text. See this example: http://paste.bradleygill.com/index.php?paste_id=9911 So there must be another difference!? – caw Jun 03 '09 at 18:13
  • Try it with text1 = "the" and text2 = "the the". You'll get a factor of 2 difference. – Mike Jun 03 '09 at 18:52
  • Sorry, I don't know what you mean. Your example always gives back 500. So what's the problem? – caw Jun 04 '09 at 14:01
  • Maybe I'm misunderstanding your code - is $counts1[$term] a hash that returns the token count for a given term? – Mike Jun 04 '09 at 17:16
  • $counts1 contains all the words of a text as the keys and the number of their occurrences as the values. This becomes clear if you see what array_count_values() does. So maybe you're really misunderstanding my code!? So what's better? My code (halfway between the two algorithms) or the cosine similarity? – caw Jun 04 '09 at 18:28
  • No, that's exactly what I thought. But look at your code - what happens when "the" occurs once in text 1, but twice in text 2? The $totalScore variable would be different. Are you uniquing your $terms2 list? That's the only place left for confusion on my part. I was assuming the $terms2 list only had unique entries. – Mike Jun 04 '09 at 19:03
  • No, neither $terms1 nor $terms2 are unique. They contain duplicate values so that the new array (after array_count_values) contain different numbers as values. For example: array("the"=>1) and array("the"=>2) – caw Jun 04 '09 at 20:31
  • oh! then we're all set - just unique the $terms(1|2) arrays after you call array_count_values, and the code i have as written will do the trick. for your purposes, i still think cosine similarity is a better measure if you're just doing text comparison. do you mind if i ask what domain you're tackling? – Mike Jun 05 '09 at 15:13
  • Yes, then your code for cosine similarity works. Thx! :) But is it better than my "halfway between the two algorithms" code? What are the differences? – caw Jun 06 '09 at 13:31
  • 1
    There is something strange in this cosine similarity function. Shouldn't the result be 1 in this case: echo check(array('a', 'b', 'c'), array('a', 'b', 'c')); instead I get 0.333 which btw is the same result as: echo check(array('a', 'b', 'c'), array('a', 'b')); – Toto Aug 17 '09 at 17:03
  • 1
    Toto is correct. The vector norm calculations are incorrect for both distance functions. – outis Aug 17 '09 at 21:34
  • I wrote a little function to calculate the cosine similarity (Bottom of this page). Hope this helps. :) – Toto Aug 17 '09 at 21:53
5

I apologize for ignoring the fact that you said that you didn't want to use any other algorithms, but seriously, Levenshtein distance and Damerau-Levenshtein distance are way more freakin' useful than Hamming distance. Here's a D-L distance implementation in PHP, and if you don't like PHP's native levenshtein() function, which I think you won't because it has a length limit, here's a non-length-limited version:

function levenshtein_distance($text1, $text2) {
    $len1 = strlen($text1);
    $len2 = strlen($text2);
    for($i = 0; $i <= $len1; $i++)
        $distance[$i][0] = $i;
    for($j = 0; $j <= $len2; $j++)
        $distance[0][$j] = $j;
    for($i = 1; $i <= $len1; $i++)
        for($j = 1; $j <= $len2; $j++)
            $distance[$i][$j] = min($distance[$i - 1][$j] + 1, $distance[$i][$j - 1] + 1, $distance[$i - 1][$j - 1] + ($text1[$i - 1] != $text2[$j - 1]));
    return $distance[$len1][$len2];
}
chaos
  • 122,029
  • 33
  • 303
  • 309
  • Thanks. I think you misunderstood something. I DON'T WANT to use only Hamming distance. I would like to apply it to the feature vector of the text, not the text itself. So I would say that it's more useful than levenshtein, isn't it? ;) But thank you for the code, I'm sure that it's useful for lots of users for other purposes. – caw Jun 03 '09 at 16:48
  • Oops. I did fail to absorb the feature vector part. Never mind. :) Since you like the code, I'll leave the answer undeleted. I hope the downvoters will have mercy. :) – chaos Jun 03 '09 at 16:50
  • Yes, they have. There are more upvoters than downvoters. ;) – caw Jun 03 '09 at 17:11
  • 2
    levenshtein should be use to calculate the edit distance. so it depends of the need. "ANNA FRED" and "FRED ANNA". Lenvenshtein will give a high number but for the cosine similarity (for the words) it will be 100% similar. Similar or not? It depends of your needs. – Toto Aug 17 '09 at 22:12
2

Here my corrected code for Cosine Distance function posted by Toto

function cosineSimilarity($tokensA, $tokensB)
{
    $a = $b = $c = 0;
    $uniqueTokensA = $uniqueTokensB = array();

    $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));

    foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
    foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;

    foreach ($uniqueMergedTokens as $token) {
        $x = isset($uniqueTokensA[$token]) ? 1 : 0;
        $y = isset($uniqueTokensB[$token]) ? 1 : 0;
        $a += $x * $y;
        $b += pow($x,2);
        $c += pow($y,2);
    }
    return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}
BenMorel
  • 34,448
  • 50
  • 182
  • 322
  • 1
    $x (and $y) will always be 1 (The token exists) or 0 (the token does not exist). In this case, POW($x, 2) will always return $x. Therefore I removed it to save cpu. :) – Toto Aug 13 '10 at 02:41
  • So your versions are both correct, Lorenzo and Toto? They both work? – caw Aug 26 '10 at 12:58