Yes you can use something like SpatialSpark. It only works with Spark 1.6.1 but you can use BroadcastSpatialJoin to create an RTree
which is extremely efficient.
Here's an example of me using SpatialSpark with PySpark to check if different polygons are within each other or are intersecting:
from ast import literal_eval as make_tuple
print "Java Spark context version:", sc._jsc.version()
spatialspark = sc._jvm.spatialspark
rectangleA = Polygon([(0, 0), (0, 10), (10, 10), (10, 0)])
rectangleB = Polygon([(-4, -4), (-4, 4), (4, 4), (4, -4)])
rectangleC = Polygon([(7, 7), (7, 8), (8, 8), (8, 7)])
pointD = Point((-1, -1))
def geomABWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt)
])
def geomCWithId():
return sc.parallelize([
(0L, rectangleC.wkt)
])
def geomABCWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt),
(2L, rectangleC.wkt)])
def geomDWithId():
return sc.parallelize([
(0L, pointD.wkt)
])
dfAB = sqlContext.createDataFrame(geomABWithId(), ['id', 'wkt'])
dfABC = sqlContext.createDataFrame(geomABCWithId(), ['id', 'wkt'])
dfC = sqlContext.createDataFrame(geomCWithId(), ['id', 'wkt'])
dfD = sqlContext.createDataFrame(geomDWithId(), ['id', 'wkt'])
# Supported Operators: Within, WithinD, Contains, Intersects, Overlaps, NearestD
SpatialOperator = spatialspark.operator.SpatialOperator
BroadcastSpatialJoin = spatialspark.join.BroadcastSpatialJoin
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
joinRDD.count()
results = joinRDD.collect()
map(lambda result: make_tuple(result.toString()), results)
# [(0, 0), (1, 1), (2, 0)] read as:
# ID 0 is within 0
# ID 1 is within 1
# ID 2 is within 0
Note the line
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
the last argument is a buffer value, in your case it would be the tolerance you want to use. It will probably be a very small number if you are using lat/lon since it's a radial system and depending on the meters you want for your tolerance you will need to calculate based on lat/lon for your area of interest.