0

I have two datasets, let's say checkins and POIs, and I have to join them based on geo-coordinates: let's say, if user was seen in radius of N km near POI, I need to join them (in other words, I want to collect all users near each POI for further manipulations). But I have an issue with this geo matching...

Initially I see two different opportunity: 1) implement LSH (locality sensitive hashing) - looks really complicated and performance might suffer as well 2) split all map in regions (2D-matrix) and then calculated how many regions are within N km distance from checkin or POI - then emit all of them - in result some deduplication must be applied - so, not sure that it is valid algo at all

any best practices?

  • So what you're asking is nearest neighbour search. Have you looked here https://en.wikipedia.org/wiki/Nearest_neighbor_search ? – Francois G Jan 17 '15 at 14:38

1 Answers1

0

Interesting problem.

I assume you have considered the naive brute-force approach and found it too time-consuming for your purposes. In the brute-force approach you calculate distances between each of n POIs and each of m checkins, leading to time complexity of O(n*m).

The most simple heuristic I can think of that is also applicable in Spark is to reduce a full linear scan of one dataset by grouping the dataset elements into buckets. Something like this:

case class Position(x: Double, y: Double)
val checkins: RDD[Position] = ???
val radius = 10
val checkinBuckets = checkins.groupBy(pos => (pos.x/radius).toInt)

Instead of a full linear scan, one can search only the corresponding, the next and the previous bucket. If necessary, one can create the second level by grouping the buckets to further speedup the lookup. Also, one should take care of details like correct rounding of pos.x/radius, gps distance calculation, etc.

Of course, you can always dive into various approaches for the nearest neighbour search problem as proped by @huitseeker. Also, this paper has a nice intro intro NNS.

Community
  • 1
  • 1
Marcel Krcah
  • 602
  • 8
  • 13