Pairwise similarity calculation in PySpark RDD takes forever

Asked Apr 29 '18 at 12:13

Active Apr 29 '18 at 18:09

Viewed 1,240 times

I have an RDD (called "data") where each row is an id/vector pair, like so:

 [('1',
   array([ 0.16501912, -0.25183533, -0.07702908,  0.07335572,  0.15868553])),
  ('2',
  array([ 0.01280832, -0.27269777,  0.09148506,  0.03950897,  0.15832097])),

I need to calculate pair-wise similarity for this RDD, comparing each row with every other row. I tried this:

pairs  = data.cartesian(data)\
        .map(lambda l: ((l[0][0], l[1][0]), l[1][1].dot(l[1][1])))\
        .sortByKey()

But this is taking forever, as the RDD is about 500k rows. I wonder if there's a better way? I am using pyspark.

Thanks very much.

edited Apr 29 '18 at 18:09

Alper t. Turker

34,230
9
83
115

asked Apr 29 '18 at 12:13

user3490622

Convert to dataframe, write a udf. – pissall Apr 29 '18 at 12:42
1

It is 25e10 operations - taking forever is expected, unless you have __a lot__ of resources, and even then transferring data alone will be very expensive. The better way is not to __calculate pair-wise similarity__ and find approximation which works well in your case. – Alper t. Turker Apr 29 '18 at 13:25
1

You can go for [this](http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH) – mayank agrawal Apr 30 '18 at 07:44
Try to first cluster your data into similar groups, then run the cartesian product within those smaller groups – user3689574 Apr 30 '18 at 12:06
Take a look at this one and see if that helps? It helped me. https://stackoverflow.com/questions/46663775/spark-cosine-distance-between-rows-using-dataframe – Gopala Apr 30 '18 at 14:11

Pairwise similarity calculation in PySpark RDD takes forever

0 Answers0