0

I want to see how relevant an incoming tweet is to my test article.I have a set of keyphrases in a string array extracted from the test article.I want to find a similarity score between this string array and incoming tweet (spark streaming) so that i could get some relevant tweets for the considered keywords.Please help me to find the similarity....

I have a String array str[a,b,c,....] I have an incoming tweet and I need to know how many strings from the array are present in the incoming tweet.The more the no.of strings matched,the more relevant is the tweet to my string of keyphrases taken.The tweet is an rdd and it has only the text in it.

  • Possible duplicate of [Efficient string matching in Apache Spark](https://stackoverflow.com/questions/43938672/efficient-string-matching-in-apache-spark) – zero323 Oct 04 '18 at 17:12

1 Answers1

0

You can build your own cosine similarity method, for example

object CosineSimilarity{

def cosineSimilarity(x: Array[Double], y: Array[Double]): Double = {
    require(x.size == y.size)
    dotProduct(x, y)/(magnitude(x)*magnitude(y))
}

def dotProduct(x: Array[Double], y: Array[Double]): Double = {
    (for((a, b)<-x zip y) yield a*b).sum
}

def magnitude(x: Array[Double]): Double = {
    math.sqrt(x.map(i => i*i).sum)
}

}

Suppose you have represented your key words as Array(1.1,1.2,1.3,1.4,1.5) and the next tweet as Array(1.1,1.2,1.2,1.5,1.6). These are very similar:

scala> import CosineSimilarity._
import CosineSimilarity._

scala> cosineSimilarity(Array(1.1,1.2,1.3,1.4,1.5),Array(1.1,1.2,1.2,1.5,1.6))
res8: Double = 0.9984816648599194

I'm not sure if you are using org.apache.spark.ml.feature.Word2Vec to transform your tokens into numeric vectors, in that case you need to transform the output into a suitable way in order to use the method above (it would be nice if you provide some code of what you have)

antonioACR1
  • 1,303
  • 2
  • 15
  • 28