Processing Pyspark RDD Partitions as Numpy on Worker Node

Question

Is it possible to partition a dataframe, then convert the dataframe partitions to numpy in parallel and use them e.g. for training some learners from Scikit learn?

For example I tried this, but I got an index error: IndexError: too many indices for array.

def files_to_numpy(data):
  data_np = np.array(data)
  X = data_np[:, [0, 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]]
  y = data_np[:,1]
data.rdd.map(files_to_numpy)

If I wanted to process the data on the driver I would have to run collect() after the conversion and then use the indexing syntax, but I want to run this in parallel on the workers (and ideally all the following training steps as well, leading to several trained instances of the learner).

score 0 · Answer 1 · answered Oct 27 '20 at 20:47

you can use data.rdd.mapPartitions(func) which will be executed for each partition on the workers. It looks like you are using the rdd api which is very low level and difficult to use. I would advise you to use the Dataframe api which is safer and easier.

There you can achieve the same thing with

df.foreachPartition(func)

pySpark forEachPartition - Where is code executed

Processing Pyspark RDD Partitions as Numpy on Worker Node

1 Answers1