1

I wish to fill the data1 array from the following script with multiprocessing. Right now, the script runs fine, but the array doesn't get filled. I tried implementing this, but due to using two iterables, I couldnt get it to work. Help appreciated; Thanks! By the way, I use jupyter notebook on the latest MacOS.

import numpy as np
import multiprocessing as mp
from itertools import product

#Generate random data:
data = np.random.randn(12,20,20)

#Create empty array to store the result
data1 = np.zeros((data.shape), dtype=np.float)

#Define the function
def fn(parameters):
    i   = parameters[0]
    j   = parameters[1]
    data1[:,i,j] =  data[:,i,j]

#Generate processes equal to the number of cores
pool = mp.Pool(processes=4)

# Generate values for each parameter: i.e. i and j
i = range(data.shape[1])
j = range(data.shape[2])

#generate a list of all combinations of the parameters
paramlist = list(product(i,j))

#call the function and multiprocessing
np.array(pool.map(fn,paramlist))
pool.close() 
Community
  • 1
  • 1
hrishi
  • 413
  • 1
  • 4
  • 15
  • What does the ``product`` function in your code? What is the exact problem that you encounter? – mommermi Apr 16 '17 at 04:55
  • @mommermi The 'product' comes from 'itertools'. The problem is that the data1 array remains unfilled after calling the function. Made edits to the original post to clarify this. Thanks! – hrishi Apr 16 '17 at 19:34

1 Answers1

2

What Pool.map does is to apply the function to the given data using worker processes. It then gathers the return data from the function and transmits that to the parent.

Since your function doesn't return anything, you get no results.

What happens is that in each worker the local copy of data1 is modified. :-)

When you have large amounts of data to be modified, multiprocessing is often not a good solution because of the overhead in moving data between the worker processes and the parent.

Try it using a single process first.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • Ah I see, Thanks. The script works well after assigning a return value, and is ~twice faster compared to serial implementation. My laptop is dual core, but with multiprocessing.cpu_count(), I get 4 as the answer. So may be there is more room to improve the performance? Would you recommend any other solution than multiprocessing then? – hrishi Apr 17 '17 at 21:26
  • By default, `multiprocessing` creates as many processes as `cpu_count` reports. If you have two physical CPU's bit `cpu_count` reports 4, it's probably that your CPU has [hyper-threading](https://en.wikipedia.org/wiki/Hyper-threading). I'm unsure as to how much hyprtreading actually increases performance: there are still only two physical CPU's. It could be that this is the reason you only see it run twice as fast. – Roland Smith Aug 09 '17 at 18:35