mpi4py with scikit-learn

Question

I am very new in parallel/distributed programming. I have an HPC-cluster with huge amount of nodes. Now, within one project I came across with parallel/distributed programming and spent 2 days exhaustivly studying that. However, there is not much information regarding this concept and mpi4py + sciki-learn. I want my code to be run on several nodes. I read a bunch of articles and web-sites and studied how to implement basic actions with mpi4py. My setup is as follow:

import numpy as np

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
name = MPI.Get_processor_name()

if rank == 0:
   [prepare data, define variables]
else:
   [variables == None]
comm.Scatterv([numpy_array, params], output_chunks, root=0)

The first questions is related to the concept of the parallel/distributed programming: Does this mean that that any function appled to the "output_chunks" will be run on each node with its own piece of data? For example:

summ_along_axis0(output_chunks)

The next question is about the first answer to this question. Why do we need to broadcast all the variables created in "rank == 0"? (may be this is stupid question that tells about my misunderstanding the whole concept.)

I need to run a scikit-learn code on the cluster. My task is to implement, for example, KMeans with KFolding. My undesrtanding about that is:

Distribute (scatter) a dataset, targets(y) and possibly groups (in case of group folding) to the nodes.
Apply "fit" to the "output_chunks" as in the code above.
Gather information from each node (do I really need to do that?)

Am I understanding the workflow right?

If it is possible, can anyone, please shortly explain me (if is possible) the whole concept of this type of programming. I will be very thankful if I will get answers to my questions here. I don't have too much time to study this properly, because the schedule of my project is very bounded in time and I am not specializing on this type of programming.

mpi4py with scikit-learn

0 Answers0