3

I'm prototyping locally some Pyspark code which deals with classifying a list of points. I'm quite new to Spark so probably my lack of skills are playing a huge role in the outcome but I have tried many things now and I would really appreciate some advice.

Here is a small snippet of code to show the main idea:

kdt = kdtree.KDTree(np_pointcloud[:,0:2])
kdt_b = sc.broadcast(kdt)
t = points_parsed.map(lambda x: kdt_b.value.query_ball_point(x,r=window_size))

Basically the idea is to have a tree from the KDTree implementation of Scipy that will be used to query a set of points from an RDD mapping for each input point a set of neighbours from within a window.

And the error I find in the Python notebook looks like it has to do with serealization. In any case I have tried to pickle the KDTree with workaround locally in my machine and after loading the pickled file the object works as expected. I'm not sure if the error has to do with the serialization or if I'm not understanding correctly the concepts I'm dealing with. The idea is to broadcast the already built KDTree object so every worker can access the object in order to perform a query when performing the map.

File "/usr/local/bin/spark-1.3.1-hadoop2.6/python/pyspark/broadcast.py",line 106, in value
self._value = self.load(self._path)
File "/usr/local/bin/spark-1.3.1binhadoop2.6/python/pyspark/broadcast.py",line 95, in load
return cPickle.loads(data)
Community
  • 1
  • 1
pescurris
  • 43
  • 1
  • 5

0 Answers0