0

I am currently developping an application with Spark in Python. I have a dataset of hotels as following: Id, Hotel name, Addres, .... , longitude, latitute

I would like to compute, for each hotel, the top 5 hotel located near by.

Is it possible to do so in Spark ? I do not know if I can parallelize my RDD with my dataset, and then compute each line with the entire dataset.


So here is what I tried : test = booking_data.cartesian(booking_data).map(lambda ((x1, y1),(x2,y2)): distanceBetweenTwoPoints)

distanceBetweenTwoPoints is my function which calculates two points and taking four parameters.

The error displayed is : ValueError: too many values to unpack

Mael Razavet
  • 202
  • 2
  • 11

1 Answers1

0

I implemented the grid-based search algorithm for efficient finding of top-rated hotels around each hotel, the idea is explained for example here. The source code can be found from my GitHub gist.

The algorithm is based on grouping hotels into "buckets" (cells of a grid) and to distribute each hotel also into its 8 nearby buckets. These are then brought together by groupByKey and analyzed independently from the rest of the data. I didn't run many tests for it but output looks reasonable. I hope this helps for future reference.

Community
  • 1
  • 1
NikoNyrh
  • 3,578
  • 2
  • 18
  • 32