using cosine similarity for two text files

Question

I tried to find the similarity of two text file using cosine similarity. I could find that when I providing the texts. but I want to get the results after reading text files in my computer.

//calculates the cosine similarity between two texts / documents etc., (having each word separated by space)

public class Cosine_Similarity
{
    public class values
    {
        int val1;
        int val2;
        values(int v1, int v2)
        {
            this.val1=v1;
            this.val2=v2;
        }

        public void Update_VAl(int v1, int v2)
        {
            this.val1=v1;
            this.val2=v2;
        }
    }//end of class values

    public double Cosine_Similarity_Score(String Text1, String Text2)
    {
        double sim_score=0.0000000;
        //1. Identify distinct words from both documents
        String [] word_seq_text1 = Text1.split(" ");
        String [] word_seq_text2 = Text2.split(" ");
        Hashtable<String, values> word_freq_vector = new Hashtable<String, 
        Cosine_Similarity.values>();
        LinkedList<String> Distinct_words_text_1_2 = new LinkedList<String>();

        //prepare word frequency vector by using Text1
        for(int i=0;i<word_seq_text1.length;i++)
        {
            String tmp_wd = word_seq_text1[i].trim();
            if(tmp_wd.length()>0)
            {
                if(word_freq_vector.containsKey(tmp_wd))
                {
                    values vals1 = word_freq_vector.get(tmp_wd);
                    int freq1 = vals1.val1+1;
                    int freq2 = vals1.val2;
                    vals1.Update_VAl(freq1, freq2);
                    word_freq_vector.put(tmp_wd, vals1);
                }
                else
                {
                    values vals1 = new values(1, 0);
                    word_freq_vector.put(tmp_wd, vals1);
                    Distinct_words_text_1_2.add(tmp_wd);
                }
            }
        }

        //prepare word frequency vector by using Text2
        for(int i=0;i<word_seq_text2.length;i++)
        {
            String tmp_wd = word_seq_text2[i].trim();
            if(tmp_wd.length()>0)
            {
                if(word_freq_vector.containsKey(tmp_wd))
                {
                    values vals1 = word_freq_vector.get(tmp_wd);
                    int freq1 = vals1.val1;
                    int freq2 = vals1.val2+1;
                    vals1.Update_VAl(freq1, freq2);
                    word_freq_vector.put(tmp_wd, vals1);
                }
                else
                {
                    values vals1 = new values(0, 1);
                    word_freq_vector.put(tmp_wd, vals1);
                    Distinct_words_text_1_2.add(tmp_wd);
                }
            }
        }

        //calculate the cosine similarity score.
        double VectAB = 0.0000000;
        double VectA_Sq = 0.0000000;
        double VectB_Sq = 0.0000000;

        for(int i=0;i<Distinct_words_text_1_2.size();i++)
        {
            values vals12 = word_freq_vector.get(Distinct_words_text_1_2.get(i));

            double freq1 = (double)vals12.val1;
            double freq2 = (double)vals12.val2;
            System.out.println(Distinct_words_text_1_2.get(i)+"#"+freq1+"#"+freq2);

            VectAB=VectAB+(freq1*freq2);

            VectA_Sq = VectA_Sq + freq1*freq1;
            VectB_Sq = VectB_Sq + freq2*freq2;
        }

        System.out.println("VectAB "+VectAB+" VectA_Sq "+VectA_Sq+" VectB_Sq "+VectB_Sq);
        sim_score = ((VectAB)/(Math.sqrt(VectA_Sq)*Math.sqrt(VectB_Sq)));

        return(sim_score);
    }

    public static void main(String[] args)
    {
        Cosine_Similarity cs1 = new Cosine_Similarity();

        System.out.println("[Word # VectorA # VectorB]");
        double sim_score = cs1.Cosine_Similarity_Score("this is text file one", "this is text file two");
        System.out.println("Cosine similarity score = "+sim_score);
    }
}

There is no clear question, what is your issue? – Joakim Danielson Jan 29 '19 at 18:56 — Joakim Danielson, Jan 29 '19 at 18:56

David Pérez Cabrera · Accepted Answer · 2019-01-30T08:24:05.780

In your code, you can compare two text strings but not two files, so you can compare two files just by converting them into two text strings. To do this you can read each file line by line and concatenate them using a space as separator.

public static void main(String[] args) throws IOException {
    Cosine_Similarity cs = new Cosine_Similarity();

    // read file 1 and convert into a String
    String text1 = Files.readAllLines(Paths.get("path/to/file1")).stream().collect(Collectors.joining(" "));
    // read file 2 and convert into a String
    String text2 = Files.readAllLines(Paths.get("path/to/file2")).stream().collect(Collectors.joining(" "));

    double score = cs.Cosine_Similarity_Score(text1, text2);
    System.out.println("Cosine similarity score = " + score);
}

By the way, read about conventions and follow them!

An example:

public class CosineSimilarity {

    private static class Values {

        private int val1;
        private int val2;

        private Values(int v1, int v2) {
            this.val1 = v1;
            this.val2 = v2;
        }

        public void updateValues(int v1, int v2) {
            this.val1 = v1;
            this.val2 = v2;
        }
    }//end of class values

    public double score(String text1, String text2) {
        //1. Identify distinct words from both documents
        String[] text1Words = text1.split(" ");
        String[] text2Words = text2.split(" ");
        Map<String, Values> wordFreqVector = new HashMap<>();
        List<String> distinctWords = new ArrayList<>();

        //prepare word frequency vector by using Text1
        for (String text : text1Words) {
            String word = text.trim();
            if (!word.isEmpty()) {
                if (wordFreqVector.containsKey(word)) {
                    Values vals1 = wordFreqVector.get(word);
                    int freq1 = vals1.val1 + 1;
                    int freq2 = vals1.val2;
                    vals1.updateValues(freq1, freq2);
                    wordFreqVector.put(word, vals1);
                } else {
                    Values vals1 = new Values(1, 0);
                    wordFreqVector.put(word, vals1);
                    distinctWords.add(word);
                }
            }
        }

        //prepare word frequency vector by using Text2
        for (String text : text2Words) {
            String word = text.trim();
            if (!word.isEmpty()) {
                if (wordFreqVector.containsKey(word)) {
                    Values vals1 = wordFreqVector.get(word);
                    int freq1 = vals1.val1;
                    int freq2 = vals1.val2 + 1;
                    vals1.updateValues(freq1, freq2);
                    wordFreqVector.put(word, vals1);
                } else {
                    Values vals1 = new Values(0, 1);
                    wordFreqVector.put(word, vals1);
                    distinctWords.add(word);
                }
            }
        }

        //calculate the cosine similarity score.
        double vectAB = 0.0000000;
        double vectA = 0.0000000;
        double vectB = 0.0000000;
        for (int i = 0; i < distinctWords.size(); i++) {
            Values vals12 = wordFreqVector.get(distinctWords.get(i));
            double freq1 = vals12.val1;
            double freq2 = vals12.val2;
            System.out.println(distinctWords.get(i) + "#" + freq1 + "#" + freq2);
            vectAB = vectAB + freq1 * freq2;
            vectA = vectA + freq1 * freq1;
            vectB = vectB + freq2 * freq2;
        }

        System.out.println("VectAB " + vectAB + " VectA_Sq " + vectA + " VectB_Sq " + vectB);
        return ((vectAB) / (Math.sqrt(vectA) * Math.sqrt(vectB)));
    }

    public static void main(String[] args) throws IOException {
        CosineSimilarity cs = new CosineSimilarity();

        String text1 = Files.readAllLines(Paths.get("path/to/file1")).stream().collect(Collectors.joining(" "));
        String text2 = Files.readAllLines(Paths.get("path/to/file2")).stream().collect(Collectors.joining(" "));

        double score = cs.score(text1, text2);
        System.out.println("Cosine similarity score = " + score);
    }

}

This is just code with no explanation on how it answers the question. Please write an explanation of your solution. — Joakim Danielson, Jan 30 '19 at 08:01

vs97 · Answer 2 · 2019-01-29T19:09:50.510

You could specify what files you want by giving their paths in the command line when you run your program, and then use those in the code as args. E.g. you would have to run your program java Cosine_Similarity path_to_text1 path_to_text2

double sim_score = cs1.Cosine_Similarity_Score(args[0], args[1]);

Currently, what you are doing is simply comparing two strings. For short strings, you can simply put them as arguments. If you want to use actual files, you will need to supply the file paths as arguments and then convert file contents into a single string, then compare. Take a look at this answer:

Passing file path as an argument in Java

using cosine similarity for two text files

2 Answers2