1

I am trying to apply a Java class for measuring cosine similarity between two documents with different length. The code for the class that is responsible to calculate this code is as following:

public class CosineSimilarityy {
    public Double calculateCosineSimilarity(HashMap<String, Double> firstFeatures, HashMap<String, Double> secondFeatures) {
        Double similarity = 0.0;
        Double sum = 0.0; // the numerator of the cosine similarity
        Double fnorm = 0.0; // the first part of the denominator of the cosine similarity
        Double snorm = 0.0; // the second part of the denominator of the cosine similarity
        Set<String> fkeys = firstFeatures.keySet();
        Iterator<String> fit = fkeys.iterator();
        while (fit.hasNext()) {
            String featurename = fit.next();
            boolean containKey = secondFeatures.containsKey(featurename);
            if (containKey) {
                sum = sum + firstFeatures.get(featurename) * secondFeatures.get(featurename);
            }
        }
        fnorm = calculateNorm(firstFeatures);
        snorm = calculateNorm(secondFeatures);
        similarity = sum / (fnorm * snorm);
        return similarity;
    }

    /**
     * calculate the norm of one feature vector
     *
     * @param feature of one cluster
     * @return
     */
    public Double calculateNorm(HashMap<String, Double> feature) {
        Double norm = 0.0;
        Set<String> keys = feature.keySet();
        Iterator<String> it = keys.iterator();
        while (it.hasNext()) {
            String featurename = it.next();
            norm = norm + Math.pow(feature.get(featurename), 2);
        }
        return Math.sqrt(norm);
    }
}

Then I construct an instance of this class, make two HashMap and assign each document to these hasmaps. Then when I try to apply the calculation, if they are identical the result is 1.0 which is right but if there is any slight differences between them, no matter what, the result is always zero. What am I missing?

public static void main(String[] args) {
    // TODO code application logic here

    CosineSimilarityy test = new CosineSimilarityy();
    HashMap<String, Double> hash = new HashMap<>();
    HashMap<String, Double> hash2 = new HashMap<>();
    hash.put("i am a book", 1.0);
    hash2.put("you are a book", 2.0);
    double result;
    result = test.calculateCosineSimilarity(hash, hash2);
    System.out.println(" this is the result: " + result);
}

The original code is taken from here.

Tom
  • 16,842
  • 17
  • 45
  • 54
lonesome
  • 2,503
  • 6
  • 35
  • 61
  • 1
    You are inputting two different features into your function, this will always result in zero similarity. – Thomas Jungblut Mar 21 '15 at 14:28
  • @ThomasJungblut but then why when are the same, it results to 1? plus, the function needs two `HasMaps`. so, if i am doing it wrong, how could it be fixed? – lonesome Mar 21 '15 at 14:32
  • `but then why when are the same, it results to 1? ` well you want to compute the similarity, if they are the same it will be 1. – Thomas Jungblut Mar 21 '15 at 14:33
  • @ThomasJungblut just now you said they are two different features and gets zero for that. however, from what i imagined from cosine similarity, it should give a `real number` result between zero and one. am i wrong? – lonesome Mar 21 '15 at 14:35
  • Yes you are correct, but `"i am a book"` and `"you are a book"` completely different features, so they result in zero similarity. – Thomas Jungblut Mar 21 '15 at 14:47
  • @ThomasJungblut why? they have `a book` as the similar part. i even checked it with `i was a book` but still zero as the result – lonesome Mar 21 '15 at 14:48
  • 1
    Then you must supply `a` and `book` as a similar feature. `"i", "am", "a", "book"` is a different representation than `"I am a book"`. How should the method know that you mean to split by the words? – Thomas Jungblut Mar 21 '15 at 14:49
  • got it. just like what `nio` mentioned. – lonesome Mar 21 '15 at 14:50
  • @ThomasJungblut yup, now got 0.75 similarity. thanks. for the hint – lonesome Mar 21 '15 at 14:52

1 Answers1

2

First, I think the "i am a book" is taken as a single feature. To do the comparison you have to split your compared strings first using a whitespace as a separator. Next you have to populate hashmaps with corresponding words extracted from a book title. You can then test your algorithm if it works correctly.

How do i split a string with any whitespace chars as delimiters?

Cosine similiarity wikipedia

nio
  • 5,141
  • 2
  • 24
  • 35
  • do you mean i must first break each `string` into characters and put them into `HashMaps` and then compute similarities between these two `HashMaps`? – lonesome Mar 21 '15 at 14:37
  • Yes i do mean that ill update the answer. What do you use as the double value for features? There's an hint here that it should be a term frequency for whole document: https://en.wikipedia.org/wiki/Cosine_similarity – nio Mar 21 '15 at 14:39
  • to be honest, i am somehow confused about the `Double` part of the HashMap that the class has. in fact, found the cosine similarity that I want [here](http://stackoverflow.com/questions/15173225/how-to-calculate-cosine-similarity-given-2-sentence-strings-python) but this is in python and not in java. so i searched for a java version and got the code that already posted here. but seems it has issues and i doubt if this would work for strings with different lengths? right? – lonesome Mar 21 '15 at 14:43
  • i guess you are right, should split the whole strings into characters and then `put` into `HashMaps` and then it works fine even for different lengths. as last question, is it wise to consider synonyms as same features? i mean, for example ,similarity between `i am a cook` and `i am a chef` must results to 1? – lonesome Mar 21 '15 at 14:58
  • Nice idea, if you do it with synonyms, you'll probably discover more related content, and more unrelated, like funny translations from google translate. – nio Mar 21 '15 at 15:50
  • you have to try it to know, you have to find some good testing data – nio Mar 22 '15 at 10:40