Python multiprocessing global numpy arrays

Question

I have following script:

max_number = 100000
minimums = np.full((max_number), np.inf, dtype=np.float32)
data = np.zeros((max_number, 128, 128, 128), dtype=np.uint8)

if __name__ == '__main__':
    main()

def worker(array, start, end):

    for in_idx in range(start, end):
        value = data[start:end][in_idx] # compute something using this array
        minimums[in_idx] = value

def main():

    jobs = []
    num_jobs = 5
    for i in range(num_jobs):
        start = int(i * (1000 / num_jobs))
        end = int(start + (1000 / num_jobs))

        p = multiprocessing.Process(name=('worker_' + str(i)), target=worker, args=(start, end))
        jobs.append(p)
        p.start()

    for proc in jobs:
        proc.join()
    print(jobs)

How can I ensure that the numpy array is global and can be accessed by each worker? Each worker uses a different part of the numpy array

Do you really want to share the array between processes? Or would it be enough to share parts of the array with each worker? — Dschoni, Aug 31 '17 at 12:02
All I want to do is to split up my array, in order to calculate something at the same time, and then in the end I want the whole array to be calculated perfectly. I do not really need to share anything between the processes, since every worker gets a different range of the original bumpy array. @Dschoni — , Aug 31 '17 at 12:04
In this case: Use a callback function, that copies the result of your worker to a global array. Pass only the indices to the worker to save on overhead. Do the calculations overlap or are they independent from each other? — Dschoni, Aug 31 '17 at 12:07
They are independent from each other. so they do not overlap. However, I do not know how to store a global numpy and use it in the method? Could you show me? @Dschoni — , Aug 31 '17 at 12:08
I have updated the answer for a better understanding. @Dschoni — , Aug 31 '17 at 12:12
Do I even need a callback function? I am very new to Python and I am not sure how this works. Because I thought I could just pass the whole array every time and only calculate parts of it with each worker and then the whole arrays is calculated and I can just use it without any callback functions or anything else @Dschoni — , Aug 31 '17 at 12:15
You CAN pass the array each time. However, if you do this, you will fill your RAM very fast if arrays get bigger. Subprocesses are a little overkill. I guess it would be enough to use a pool of workers. I'll post a minimal example in a sec. — Dschoni, Aug 31 '17 at 12:19
Okay thank you! I just need speed! Because my array is like: `(10.000, 128x128x128)` and what I basically want to do is to split it in little chunks like: (5.000, 128x128x128) --> 4 chunks and then calculate the chunks separately @Dschoni — , Aug 31 '17 at 12:20
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/153364/discussion-between-dschoni-and-thigi). — Dschoni, Aug 31 '17 at 12:26

Dschoni · Answer 1 · 2017-08-31T12:40:18.477

1

import numpy as np
import multiprocessing as mp

ar = np.zeros((5,5))

def callback_function(result):
    x,y,data = result
    ar[x,y] = data

def worker(num):
    data = ar[num,num]+3
    return num, num, data

def apply_async_with_callback():
    pool = mp.Pool(processes=5)
    for i in range(5):
        pool.apply_async(worker, args = (i, ), callback = callback_function)
    pool.close()
    pool.join()
    print "Multiprocessing done!"

if __name__ == '__main__':
    ar = np.ones((5,5)) #This will be used, as local scope comes before global scope
    apply_async_with_callback()

Explanation: You set up your data array and your workers and callback functions. The number of processes in the pool set up a number of independent workers, where each worker can do more than one task. The callback writes the result back to the array.

The __name__=='__main__' protects the following line from being run at each import.

edited Aug 31 '17 at 12:40

answered Aug 31 '17 at 12:25

Dschoni

3,714
6
45
80

And what if I have the `ar` in another function? like a `main()` function? – Aug 31 '17 at 12:29
I am calculation the `ar` array and its size in your `apply_async_with_calback()` function. This is my main problem. So how do I make this array global? – Aug 31 '17 at 12:32
Than return it from the function before you run multiprocessing on it. The above code is just a minimal example. You can always extend it to suit your needs. – Dschoni Aug 31 '17 at 12:33
which function? Could you quickly adapt the code so that the `ar` arrays is initialised in the `apply_async_with_callback()` function? Sorry but I want to get this 100% right :) – Aug 31 '17 at 12:34
Better not mix multiprocessing parts and initialising with each other. Write a several routine that does all your inits, put it before the multiprocessing, should do the trick. – Dschoni Aug 31 '17 at 12:38
Updated my answer, to reflect how the setup can be done inside `main()`. – Dschoni Aug 31 '17 at 12:41
Why do I even need the callback function? Instead of this: `data = ar[num,num]+3` could I not do: `ar = ar[num,num]+3` ? – Aug 31 '17 at 12:41
I am not sure you did understand my question properly. Could you have a look at my code again please? I think your answer is not what I wanted to do – Aug 31 '17 at 12:47
This is due to the inner workings of multiprocessing. You COULD make that work, but then, every worker would hold a copy of the array. In my example, the workers are read-only, the callback is living in the main process and handles the writes to memory. So there is no need for locking, synchronizing etc. It depends a little on your exact problem if this approach suits you. – Dschoni Aug 31 '17 at 13:06
I'm pretty sure I got your question. I'll update my code to show, how you can work on chunks of data. – Dschoni Aug 31 '17 at 13:09
Can you please use a process? I just read that pool only uses once process and I want to use multiple cpus to be used for my code since My data is huge! I am working on a solution using process rather than pool! – Aug 31 '17 at 13:10
Please follow my code if possible! I definitely want to use multiprocessing.Process() ! I need very good performance and I have a lot of CPUs available – Aug 31 '17 at 13:11
I have also updated my code! So now it should be quite easy to fix my minor bugs, or am I completely wrong? – Aug 31 '17 at 13:14
No, it doesn't. A pool uses the number of processes you pass to it via the `(processes = x)` number. I'm using that code on 64 cpus with 64 processes. Works like a charm. If you want further help, please move to chat. – Dschoni Aug 31 '17 at 13:16
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/153370/discussion-between-thigi-and-dschoni). – Aug 31 '17 at 13:16
This does not work! The global variables do not get assigned in the callback function... – Sep 01 '17 at 14:17
here: [link](https://stackoverflow.com/questions/46001137/global-array-not-assigned-when-using-multiprocessing-callback-functions-with-pyt) – Sep 01 '17 at 14:18
For me it works, the problem in the linked question is a misinterpretation of the code shown here. – Dschoni Sep 06 '17 at 12:32
I really cannot see where the difference is. You are using: ar = np.zeros((5,5)) and then in the callback u are using: ar[x,y] = data ???@Dschoni – Sep 06 '17 at 14:46

Python multiprocessing global numpy arrays

1 Answers1

Linked