Broadcast KDTree object in Pyspark

Question

I'm prototyping locally some Pyspark code which deals with classifying a list of points. I'm quite new to Spark so probably my lack of skills are playing a huge role in the outcome but I have tried many things now and I would really appreciate some advice.

Here is a small snippet of code to show the main idea:

kdt = kdtree.KDTree(np_pointcloud[:,0:2])
kdt_b = sc.broadcast(kdt)
t = points_parsed.map(lambda x: kdt_b.value.query_ball_point(x,r=window_size))

Basically the idea is to have a tree from the KDTree implementation of Scipy that will be used to query a set of points from an RDD mapping for each input point a set of neighbours from within a window.

And the error I find in the Python notebook looks like it has to do with serealization. In any case I have tried to pickle the KDTree with workaround locally in my machine and after loading the pickled file the object works as expected. I'm not sure if the error has to do with the serialization or if I'm not understanding correctly the concepts I'm dealing with. The idea is to broadcast the already built KDTree object so every worker can access the object in order to perform a query when performing the map.

File "/usr/local/bin/spark-1.3.1-hadoop2.6/python/pyspark/broadcast.py",line 106, in value
self._value = self.load(self._path)
File "/usr/local/bin/spark-1.3.1binhadoop2.6/python/pyspark/broadcast.py",line 95, in load
return cPickle.loads(data)

Have you been recreated a similar scenario? Have you been able to collect any value? For me the problem comes when I try to collect values. — pescurris, Dec 03 '16 at 09:13
Yes. It is either some 1.3 issue (didn't bother to check) or other problem with your code. — , Dec 03 '16 at 13:21
@LostInOverflow can you provide the snipped of code you tested? — pescurris, Dec 03 '16 at 15:16
@LostInOverflow I tested with Spark 2.0 but I still have problems related to serializing the scipy KDTree — pescurris, Dec 03 '16 at 15:17

Broadcast KDTree object in Pyspark

0 Answers0