Is it possible to partition a dataframe, then convert the dataframe partitions to numpy in parallel and use them e.g. for training some learners from Scikit learn?
For example I tried this, but I got an index error: IndexError: too many indices for array
.
def files_to_numpy(data):
data_np = np.array(data)
X = data_np[:, [0, 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]]
y = data_np[:,1]
data.rdd.map(files_to_numpy)
If I wanted to process the data on the driver I would have to run collect()
after the conversion and then use the indexing syntax, but I want to run this in parallel on the workers (and ideally all the following training steps as well, leading to several trained instances of the learner).