I would like to fetch similar documents from a collection.
Sample text is provided below
car killed cat
Train killed cat
john plays cricket
tom like mangoes
I expect "car killed cat" and "train killed cat" to be identified as similar documents
I have tokenized the text, removed stop words and computed IDF using the below code
// TOKENIZE DATA
regexTokenizer = new RegexTokenizer()
.setInputCol("text")
.setOutputCol("words")
.setPattern("\\W");
DataFrame tokenized = regexTokenizer.transform(trainingRiskData);
// REMOVE STOP WORDS
remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered");
DataFrame stopWordsRemoved = remover.transform(tokenized);
// COMPUTE TERM FREQUENCY USING HASHING
int numFeatures = 20;
hashingTF = new HashingTF().setInputCol("filtered").setOutputCol("rawFeatures")
.setNumFeatures(numFeatures);
DataFrame rawFeaturizedData = hashingTF.transform(stopWordsRemoved);
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
idfModel = idf.fit(rawFeaturizedData);
DataFrame featurizedData = idfModel.transform(rawFeaturizedData);
This is how my final data frame looks like
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
|id |text |words |filtered |rawFeatures |features |
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
|1 |car killed cat |[car, killed, cat] |[car, killed, cat] |(50,[10,12,13],[1.0,1.0,1.0])|(50,[10,12,13],[0.9162907318741551,0.5108256237659907,0.22314355131420976])|
|2 |Train killed cat |[train, killed, cat] |[train, killed, cat] |(50,[12,13,42],[1.0,1.0,1.0])|(50,[12,13,42],[0.5108256237659907,0.22314355131420976,0.9162907318741551])|
|3 |john plays cricket|[john, plays, cricket]|[john, plays, cricket]|(50,[1,5,13],[1.0,1.0,1.0]) |(50,[1,5,13],[0.5108256237659907,0.9162907318741551,0.22314355131420976]) |
|4 |tom like mangoes |[tom, like, mangoes] |[tom, like, mangoes] |(50,[1,18,26],[1.0,1.0,1.0]) |(50,[1,18,26],[0.5108256237659907,0.9162907318741551,0.9162907318741551]) |
+---+------------------+----------------------+----------------------+-----------------------------+---------------------------------------------------------------------------+
What I understand from the below link is that I can compute cosine similarity to find similarity between two vectors.
https://github.com/goldshtn/spark-workshop/blob/master/python/lab7-plagiarism.md
My requirement is different. I want to identify similar documents in a collection of documents
I would like to know if the below solution would help
I have converted this data frame into a RowMatrix using the below code and invoke cosineSimilarity, then I get a CoordinateMatrix as result
JavaRDD<Vector> tempRDD = featurizedData.select("text", "features").toJavaRDD()
.map(new Function<Row, Vector>() {
@Override
public Vector call(Row arg0) throws Exception {
org.apache.spark.mllib.linalg.Vector v = (org.apache.spark.mllib.linalg.Vector) arg0.get(1) ;
return v ;
}
});
RowMatrix rowMatrix = new RowMatrix(tempRDD.rdd());
CoordinateMatrix matchingData = rowMatrix.columnSimilarities(0.8);
CoordinateMatrix is collection of MatrixEntry .
Below is the CoordinateMatrix
MatrixEntry(10,13,0.5773502691896257)
MatrixEntry(10,13,0.5773502691896257)
MatrixEntry(12,42,0.7071067811865476)
MatrixEntry(1,13,0.408248290463863)
MatrixEntry(1,18,0.7071067811865476)
MatrixEntry(1,5,0.7071067811865476)
MatrixEntry(18,26,1.0)
MatrixEntry(5,13,0.5773502691896257)
MatrixEntry(1,26,0.7071067811865476)
MatrixEntry(12,13,0.816496580927726)
MatrixEntry(10,12,0.7071067811865476)
how can I read this matrix ?
In case If my approach is totally incorrect , kindly let me know