Apache Spark - Is it possible to compute N to N operations on same RDD

Question

I am currently developping an application with Spark in Python. I have a dataset of hotels as following: Id, Hotel name, Addres, .... , longitude, latitute

I would like to compute, for each hotel, the top 5 hotel located near by.

Is it possible to do so in Spark ? I do not know if I can parallelize my RDD with my dataset, and then compute each line with the entire dataset.

So here is what I tried : test = booking_data.cartesian(booking_data).map(lambda ((x1, y1),(x2,y2)): distanceBetweenTwoPoints)

distanceBetweenTwoPoints is my function which calculates two points and taking four parameters.

The error displayed is : ValueError: too many values to unpack

Not certain I understand the question. You can .filter your data set by distance and then do a .top — RaGe, Apr 27 '15 at 13:33
Sorry for the incomplete comment: "Give it a try... and if it doesn't work, show us some code to be able to help" — maasg, Apr 27 '15 at 13:43
this is not a Spark problem, it's a Python problem. You'd need to supply us with more code and tell us where that error occurs exactly. — Mateusz Dymczyk, Sep 17 '15 at 15:29

score 0 · Answer 1 · edited May 23 '17 at 10:30

I implemented the grid-based search algorithm for efficient finding of top-rated hotels around each hotel, the idea is explained for example here. The source code can be found from my GitHub gist.

The algorithm is based on grouping hotels into "buckets" (cells of a grid) and to distribute each hotel also into its 8 nearby buckets. These are then brought together by groupByKey and analyzed independently from the rest of the data. I didn't run many tests for it but output looks reasonable. I hope this helps for future reference.

Apache Spark - Is it possible to compute N to N operations on same RDD

1 Answers1