1

I've searched Python 3 Docs, example paralized code, and threads regarding setting the chunksize argument, but I am still at a loss.

I have multiple large arrays that are passed as arguments to a function that I need to parallelize. After I zip the function arguments to make it an iteratable, the starmap function seems to pass the arguments with the chunksize = 1, even when I pass a different chunksize argument to starmap. Ideally, I need to chop up my arrays into X chunks, and send them to X CPUs. What am I doing wrong?

I used the following example code

import multiprocessing
import numpy as np

def square(x, y):
    if type(x) == int or type(y) == int:
        print(x, y, 'Passed one pair of values')
    return np.multiply(x,y)

a = range(100)
b = range(100)
use_num_cpu = multiprocessing.cpu_count()-1
pl = multiprocessing.Pool(processes = use_num_cpu)
args = zip(a, b)
M = pl.starmap(func = square, iterable = args, chunksize = 10)

The output looks like this:

0 0 Passed one pair of values
10 10 Passed one pair of values
1 1 Passed one pair of values
11 11 Passed one pair of values
20 20 Passed one pair of values
2 2 Passed one pair of values
12 12 Passed one pair of values
13 13 Passed one pair of values
3 3 Passed one pair of values
14 14 Passed one pair of values
15 15 Passed one pair of values
4 4 Passed one pair of values
16 16 Passed one pair of values
5 5 Passed one pair of values
...
90 90 Passed one pair of values
77 77 Passed one pair of values
78 78 Passed one pair of values
91 91 Passed one pair of values
79 79 Passed one pair of values
92 92 Passed one pair of values
93 93 Passed one pair of values
94 94 Passed one pair of values
95 95 Passed one pair of values
96 96 Passed one pair of values
97 97 Passed one pair of values
98 98 Passed one pair of values
99 99 Passed one pair of values
Community
  • 1
  • 1
  • I think it's doing what you expect? The function `square` does not accept an iterable but a single tuple pair. The printout order is garbled, suggesting different processes running at different speeds. How have you verified that it hasn't chunked the `args` and passed each chunk to a different process, with each chunk being processed iteratively? – roganjosh Oct 17 '16 at 18:26
  • I guess I am a naive Python user, but I was under the impression that square(x, y) function accepts arrays/lists as well. Is there a keyword that I am missing? – Mike Shumko Oct 17 '16 at 18:37
  • `chunksize` is part of the handshake between the parent and child agents as they ship payload between processes but doesn't affect how the child worker is called. With a chunksize greater than 1, the child is more likely to have data at the ready and won't have to fetch more from the parent. It is useful when you have relatively large amounts of data and relatively fast processing of the data. But its just an implementation detail - the child worker is called just as many times, no matter the size. – tdelaney Oct 17 '16 at 18:49
  • `if type(x) or type(y) == int:` this is flawed I think. I think you need `if type(x) == int and type(y) == int:`. Your current code will accept a list with strings in it and print the full lists. I removed `np.multiply` to check as that shouldn't be possible as per my understanding of your intentions? – roganjosh Oct 17 '16 at 18:57
  • Sorry I did not specify it, but `if type(x) or type(y) == int`: is a test to debug what is passed into the function. I will need to iterate over `x` and `y`, which thew errors before, so I need `x` and `y` to be arrays. Since the program goes into the if statement, at least one of them is not. It does not throw an error since `np.multiply()` handles integer multiplication as well. – Mike Shumko Oct 17 '16 at 18:58
  • Ok, I have no experience with `starmap` but " I need to chop up my arrays into X chunks, and send them to X CPUs" sounds similar to my [answer here](http://stackoverflow.com/questions/39750873/python-multi-threading-in-a-recordset/39753853#39753853). I really wish `multiprocessing` had examples relevant to the tasks that require multiprocessing because it seems like a black box in many cases. I don't know how to address your requirement with `Pool` – roganjosh Oct 17 '16 at 19:15
  • [Python multiprocessing: understanding logic behind chunksize](https://stackoverflow.com/q/53751050/9059420) – Darkonaut Feb 26 '19 at 14:54

0 Answers0