1

I have an RDD (called "data") where each row is an id/vector pair, like so:

 [('1',
   array([ 0.16501912, -0.25183533, -0.07702908,  0.07335572,  0.15868553])),
  ('2',
  array([ 0.01280832, -0.27269777,  0.09148506,  0.03950897,  0.15832097])),

I need to calculate pair-wise similarity for this RDD, comparing each row with every other row. I tried this:

pairs  = data.cartesian(data)\
        .map(lambda l: ((l[0][0], l[1][0]), l[1][1].dot(l[1][1])))\
        .sortByKey()

But this is taking forever, as the RDD is about 500k rows. I wonder if there's a better way? I am using pyspark.

Thanks very much.

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
user3490622
  • 939
  • 2
  • 11
  • 30
  • Convert to dataframe, write a udf. – pissall Apr 29 '18 at 12:42
  • 1
    It is 25e10 operations - taking forever is expected, unless you have __a lot__ of resources, and even then transferring data alone will be very expensive. The better way is not to __calculate pair-wise similarity__ and find approximation which works well in your case. – Alper t. Turker Apr 29 '18 at 13:25
  • 1
    You can go for [this](http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH) – mayank agrawal Apr 30 '18 at 07:44
  • Try to first cluster your data into similar groups, then run the cartesian product within those smaller groups – user3689574 Apr 30 '18 at 12:06
  • Take a look at this one and see if that helps? It helped me. https://stackoverflow.com/questions/46663775/spark-cosine-distance-between-rows-using-dataframe – Gopala Apr 30 '18 at 14:11

0 Answers0