Apache spark identify similar documents

Question

I would like to fetch similar documents from a collection.

Sample text is provided below

car killed cat                 
Train killed cat                
john plays cricket             
tom like mangoes

I expect "car killed cat" and "train killed cat" to be identified as similar documents

I have tokenized the text, removed stop words and computed IDF using the below code

 // TOKENIZE DATA

            regexTokenizer = new RegexTokenizer()
                      .setInputCol("text")
                      .setOutputCol("words")
                      .setPattern("\\W"); 

            DataFrame tokenized = regexTokenizer.transform(trainingRiskData);

    // REMOVE STOP WORDS

            remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");

            DataFrame stopWordsRemoved = remover.transform(tokenized);

// COMPUTE TERM FREQUENCY USING HASHING

        int numFeatures = 20;
        hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
                .setNumFeatures(numFeatures);
        DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);

    IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
        idfModel = idf.fit(rawFeaturizedData);

        DataFrame featurizedData = idfModel.transform(rawFeaturizedData);

This is how my final data frame looks like

+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
|id |text              |words                 |filtered              |rawFeatures                  |features                                                                   |
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
|1  |car killed cat    |[car, killed, cat]    |[car, killed, cat]    |(50,[10,12,13],[1.0,1.0,1.0])|(50,[10,12,13],[0.9162907318741551,0.5108256237659907,0.22314355131420976])|
|2  |Train killed cat  |[train, killed, cat]  |[train, killed, cat]  |(50,[12,13,42],[1.0,1.0,1.0])|(50,[12,13,42],[0.5108256237659907,0.22314355131420976,0.9162907318741551])|
|3  |john plays cricket|[john, plays, cricket]|[john, plays, cricket]|(50,[1,5,13],[1.0,1.0,1.0])  |(50,[1,5,13],[0.5108256237659907,0.9162907318741551,0.22314355131420976])  |
|4  |tom like mangoes  |[tom, like, mangoes]  |[tom, like, mangoes]  |(50,[1,18,26],[1.0,1.0,1.0]) |(50,[1,18,26],[0.5108256237659907,0.9162907318741551,0.9162907318741551])  |
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+

What I understand from the below link is that I can compute cosine similarity to find similarity between two vectors.

https://github.com/goldshtn/spark-workshop/blob/master/python/lab7-plagiarism.md

My requirement is different. I want to identify similar documents in a collection of documents

I would like to know if the below solution would help

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

I have converted this data frame into a RowMatrix using the below code and invoke cosineSimilarity, then I get a CoordinateMatrix as result

JavaRDD<Vector> tempRDD = featurizedData.select("text", "features").toJavaRDD()
                .map(new Function<Row, Vector>() {

                    @Override
                    public Vector call(Row arg0) throws Exception {
                        org.apache.spark.mllib.linalg.Vector v = (org.apache.spark.mllib.linalg.Vector) arg0.get(1) ;
                        return v ;
                    }
                });

        RowMatrix rowMatrix = new RowMatrix(tempRDD.rdd());

        CoordinateMatrix matchingData =  rowMatrix.columnSimilarities(0.8);

CoordinateMatrix is collection of MatrixEntry .

Below is the CoordinateMatrix

MatrixEntry(10,13,0.5773502691896257)
MatrixEntry(10,13,0.5773502691896257)
MatrixEntry(12,42,0.7071067811865476)
MatrixEntry(1,13,0.408248290463863)
MatrixEntry(1,18,0.7071067811865476)
MatrixEntry(1,5,0.7071067811865476)
MatrixEntry(18,26,1.0)
MatrixEntry(5,13,0.5773502691896257)
MatrixEntry(1,26,0.7071067811865476)
MatrixEntry(12,13,0.816496580927726)
MatrixEntry(10,12,0.7071067811865476)

how can I read this matrix ?

In case If my approach is totally incorrect , kindly let me know

Apache spark identify similar documents

0 Answers0