In python2, I would like to fill a global array by filling with parallel processes (or threads) different sub-arrays (there is a total 16 blocks). I must precise that each block doesn't depend of the others, I mean when I perfom the assignement of each cells of the current block.
1) From what I have found, I would have a great benefit from a CPU multi-cores by using different "processes
" but it seems a little bit complicated to share the global array by all others processes.
2) From another point of view, I can use "threads
" instead of "processes
" since the implementation is less hard. I have found out the libray "ThreadPool
" from "multiprocessing.dummy
" allows to share this global array by all others concurrent threads.
For example, with python2.7, the following code works :
from multiprocessing.dummy import Pool as ThreadPool
## discretization along x-axis and y-axis for each block
arrayCross_k = np.linspace(kMIN, kMAX, dimPoints)
arrayCross_mu = np.linspace(-1, 1, dimPoints)
# Build all big matrix with N total blocks = dimBlock*dimBlock = 16 here
arrayFullCross = np.zeros((dimBlocks, dimBlocks, arrayCross_k.size, arrayCross_mu.size))
dimBlocks = 4
# Size of dimension along k and mu axis
dimPoints = 100
# dimension along one dimension of global arrayFullCross
dimMatCovCross = dimBlocks*dimPoints
# Build cross-correlation matrix
def buildCrossMatrix_loop(params_array):
# rows indices
xb = params_array[0]
# columns indices
yb = params_array[1]
# Current redshift
z = zrange[params_array[2]]
# Loop inside block
for ub in range(dimPoints):
for vb in range(dimPoints):
# Diagonal blocs
if (xb == yb):
# Fill the (xb,yb) su-block of global array by
arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb] , z, 10**P_m(np.log10(arrayCross_k[ub])),
...
...
# End of function buildCrossMatrix_loop
# Main loop
while i < len(zrange):
def generatorCrossMatrix(index):
for igen in range(dimBlocks):
for lgen in range(dimBlocks):
yield igen, lgen, index
if __name__ == '__main__':
# Use 20 threads
pool = ThreadPool(20)
pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
# Increment index "i"
i = i+1
But unfortunately, even by using 20 threads, I realize that the cores of my CPU are not fully running (actually, with 'top' or 'htop' command, I only see a single process at 100%).
3) What is the strategy that I have to chose if I want to full exploit the 16 cores of my CPU (like this is the case with pool.map(function, generator)) but with also the sharing of global array
?
4) some people told me to do I/O for each sub-array (basically, write each block in a file and gather all sub-arrays by reading them and get the full array filled). This solution is handy but I would like to avoid I/O (unless there is really not other solutions).
5) I have practised MPI library
with C language
and the operation of filling sub-array and finally gather them to build a big array, is not very complicated. However, I wouldn't like to use MPI
with Python language (I don't know even if it exists).
6) I tried also to use Process
with target equal to my filling function (buildCrossMatrix_loop
) like this into while
Main loop above :
from multiprocessing import Process
# Main loop on z range
while i < len(zrange):
params_p = []
for ip in range(4):
for jp in range(4):
params_p.append(ip)
params_p.append(jp)
params_p.append(i)
p = Process(target=buildCrossMatrix_loop, args=(params_p,))
params_p = []
p.start()
# Finished : wait everybody
p.join()
...
...
i = i+1
# End of main while loop
But the final 2D global array is filled only of zeros. So I must deduce that Process
function doesn't share the array to fill ?
7) So which strategy I have to look for ? :
1. The using of "pool processes" and find a way to share the global array knowing all my 16-cores will be running
2. The using of "Threads" and share the global array but performances, at first sight, seems to be less good than with "pool processes". Maybe there is a way to increase the power of each "Threads", I mean like with "pool processes" ?
I tried to follow the different examples on https://docs.python.org/2/library/multiprocessing.html but without success, this is to say, without relevant performances from a speed-up point of view.
I think that in my case, the major issue is the gathering of all sub-arrays OR the fact that the global array arrayFullCross
is not shared by other processes or threads.
If someone had a simple example of the sharing of global variable in a multi-threading context (here an array), this would nice to put it here.
UPDATE 1: I made test with the Threading
(and not multiprocessing
) but performances remain rather bad. GIL is not apparently unlocked, i.e only one process appears in htop
command (maybe the version of Threading library is not the right one).
So I am going to try to handle my issue with using the "return" method.
Naively, I tried to return the whole array at the end of the function on which I apply the map
function, like this :
# Build cross-correlation matrix
def buildCrossMatrix_loop(params_array):
# rows indices
xb = params_array[0]
# columns indices
yb = params_array[1]
# Current redshift
z = zrange[params_array[2]]
# Loop inside block
for ub in range(dimPoints):
for vb in range(dimPoints):
# Diagonal blocs
if (xb == yb):
arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb])
...
... #others assignments on arrayFullCross elements
# Return global array to main process
return arrayFullCross
Then, I tried to receive this global array from map
like this :
if __name__ == '__main__':
pool = Pool(16)
outputArray = pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
pool.terminate()
## Print outputArray
print 'outputArray = ', outputArray
## Reshape 4D outputArray to 2D array
arrayFullCross2D_swap = np.array(outputArray).swapaxes(1,2).reshape(dimMatCovCross,dimMatCovCross)
Unfortunately, when I print the outputArray
, I get :
outputArray = [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
This is not the 4D outputArray expected, just a list of 16 None (I think that number of 16 correspond to the number of processes provided by generatorCrossMatrix(i)
).
How could I get back the whole 4D array once map
is launched and when it has finished ?