"Processes" or "Threads" strategy to fill independant blocks into global array?

Question

In python2, I would like to fill a global array by filling with parallel processes (or threads) different sub-arrays (there is a total 16 blocks). I must precise that each block doesn't depend of the others, I mean when I perfom the assignement of each cells of the current block.

1) From what I have found, I would have a great benefit from a CPU multi-cores by using different "processes" but it seems a little bit complicated to share the global array by all others processes.

2) From another point of view, I can use "threads" instead of "processes" since the implementation is less hard. I have found out the libray "ThreadPool" from "multiprocessing.dummy" allows to share this global array by all others concurrent threads.

For example, with python2.7, the following code works :

from multiprocessing.dummy import Pool as ThreadPool

## discretization along x-axis and y-axis for each block
arrayCross_k = np.linspace(kMIN, kMAX, dimPoints)
arrayCross_mu = np.linspace(-1, 1, dimPoints)
# Build all big matrix with N total blocks = dimBlock*dimBlock = 16 here
arrayFullCross = np.zeros((dimBlocks, dimBlocks, arrayCross_k.size, arrayCross_mu.size))
dimBlocks = 4
# Size of dimension along k and mu axis
dimPoints = 100
# dimension along one dimension of global arrayFullCross
dimMatCovCross = dimBlocks*dimPoints

# Build cross-correlation matrix 
def buildCrossMatrix_loop(params_array):
  # rows indices
  xb = params_array[0]
  # columns indices
  yb = params_array[1]
  # Current redshift
  z = zrange[params_array[2]]
  # Loop inside block
  for ub in range(dimPoints):
    for vb in range(dimPoints):
      # Diagonal blocs 
      if (xb == yb):
      # Fill the (xb,yb) su-block of global array by 
        arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb] , z, 10**P_m(np.log10(arrayCross_k[ub])),

        ...
        ...

# End of function buildCrossMatrix_loop

# Main loop
while i < len(zrange):

  def generatorCrossMatrix(index):
    for igen in range(dimBlocks):
      for lgen in range(dimBlocks):
        yield igen, lgen, index

if __name__ == '__main__':

  # Use 20 threads
  pool = ThreadPool(20)
  pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))

  # Increment index "i"
  i = i+1

But unfortunately, even by using 20 threads, I realize that the cores of my CPU are not fully running (actually, with 'top' or 'htop' command, I only see a single process at 100%).

3) What is the strategy that I have to chose if I want to full exploit the 16 cores of my CPU (like this is the case with pool.map(function, generator)) but with also the sharing of global array ?

4) some people told me to do I/O for each sub-array (basically, write each block in a file and gather all sub-arrays by reading them and get the full array filled). This solution is handy but I would like to avoid I/O (unless there is really not other solutions).

5) I have practised MPI library with C language and the operation of filling sub-array and finally gather them to build a big array, is not very complicated. However, I wouldn't like to use MPI with Python language (I don't know even if it exists).

6) I tried also to use Process with target equal to my filling function (buildCrossMatrix_loop) like this into while Main loop above :

from multiprocessing import Process

# Main loop on z range
while i < len(zrange):

  params_p = []
  for ip in range(4):
    for jp in range(4):
      params_p.append(ip)
      params_p.append(jp)
      params_p.append(i)
      p = Process(target=buildCrossMatrix_loop, args=(params_p,))
      params_p = []
      p.start()

  # Finished : wait everybody
  p.join()

  ...
  ...

  i = i+1
  # End of main while loop

But the final 2D global array is filled only of zeros. So I must deduce that Process function doesn't share the array to fill ?

7) So which strategy I have to look for ? :

1. The using of "pool processes" and find a way to share the global array knowing all my 16-cores will be running

2. The using of "Threads" and share the global array but performances, at first sight, seems to be less good than with "pool processes". Maybe there is a way to increase the power of each "Threads", I mean like with "pool processes" ?

I tried to follow the different examples on https://docs.python.org/2/library/multiprocessing.html but without success, this is to say, without relevant performances from a speed-up point of view.

I think that in my case, the major issue is the gathering of all sub-arrays OR the fact that the global array arrayFullCross is not shared by other processes or threads.

If someone had a simple example of the sharing of global variable in a multi-threading context (here an array), this would nice to put it here.

UPDATE 1: I made test with the Threading (and not multiprocessing) but performances remain rather bad. GIL is not apparently unlocked, i.e only one process appears in htop command (maybe the version of Threading library is not the right one).

So I am going to try to handle my issue with using the "return" method.

Naively, I tried to return the whole array at the end of the function on which I apply the map function, like this :

# Build cross-correlation matrix 
def buildCrossMatrix_loop(params_array):

  # rows indices
  xb = params_array[0]
  # columns indices
  yb = params_array[1]
  # Current redshift
  z = zrange[params_array[2]]
  # Loop inside block
  for ub in range(dimPoints):
    for vb in range(dimPoints):
      # Diagonal blocs 
      if (xb == yb):         
        arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb])

      ... 
      ... #others assignments on arrayFullCross elements

  # Return global array to main process
  return arrayFullCross

Then, I tried to receive this global array from map like this :

if __name__ == '__main__':

  pool = Pool(16)
  outputArray = pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
  pool.terminate()
  ## Print outputArray
  print 'outputArray = ', outputArray

  ## Reshape 4D outputArray to 2D array
  arrayFullCross2D_swap = np.array(outputArray).swapaxes(1,2).reshape(dimMatCovCross,dimMatCovCross)

Unfortunately, when I print the outputArray, I get :

outputArray =  [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

This is not the 4D outputArray expected, just a list of 16 None (I think that number of 16 correspond to the number of processes provided by generatorCrossMatrix(i)).

How could I get back the whole 4D array once map is launched and when it has finished ?

Change `multiprocessing.dummy` to `multiprocessing`... the `.dummy` module versions are fake versions that do not provide the actual multithreading/multiprocessing capabilities! — Bakuriu, Feb 14 '19 at 21:37

score 0 · Accepted Answer · answered Feb 14 '19 at 21:42

First of all I believe multiprocessing.ThreadPool is a private API so you should avoid it. Now multiprocessing.dummy is a useless module. It does not do any multithreading/processing that's why you don't see any benefit. You should use the "plain" multiprocessing module.

The second code does not work because it is using multiple processes. Processes do not share memory, so the changes you do in a subprocess are not reflected in the other subprocesses or the main process. You either want to:

Return the value and combine them together in the main process, for example using multiprocessing.Pool.map
Use threading instead of multiprocessing: just replaceimport multiprocessingwithimport threadingandmultiprocessing.Processwiththreading.Thread` and the code should work.

Note that the threading version will work only because numpy releases the GIL during computations, otherwise it would be stuck at 1 CPU.

You may want to look at this similar question which I answered a couple of minutes ago.

Thanks for your help. The `Threading`version has performances relatively mean, so I try to find a solution with your second suggestion, i.e with the **return** object from the function called by `map`, could you take a look please at my **UPDATE 1** above, regards — , Feb 14 '19 at 23:38
@youpilat13 You should really try to most a complete version of your code. Python returns `None` when it ends up finishing a function without encountering an explicit `return` statement. I believe you might have a path in your code that leads to such situation. — Bakuriu, Feb 15 '19 at 18:22

"Processes" or "Threads" strategy to fill independant blocks into global array?

1 Answers1