I have an RDD (called "data") where each row is an id/vector pair, like so:
[('1',
array([ 0.16501912, -0.25183533, -0.07702908, 0.07335572, 0.15868553])),
('2',
array([ 0.01280832, -0.27269777, 0.09148506, 0.03950897, 0.15832097])),
I need to calculate pair-wise similarity for this RDD, comparing each row with every other row. I tried this:
pairs = data.cartesian(data)\
.map(lambda l: ((l[0][0], l[1][0]), l[1][1].dot(l[1][1])))\
.sortByKey()
But this is taking forever, as the RDD is about 500k rows. I wonder if there's a better way? I am using pyspark.
Thanks very much.