0

I am having trouble understanding the "chunksize" parameter in pool.map.

For the following codes, I get the same results whether I use '2' or nothing for the chunksize parameter.

import multiprocessing
from multiprocessing import Pool
lst_of_lst = [[1,2],[3,4],[5,6],[7,8]]
def count(lst):
    return len(lst)
if __name__ == '__main__':
    P = Pool(2)
    for results in P.map(count,lst_of_lst,2):
        print (results)
    P.close()
    P.join()

Result are always: "2 2 2 2"

With a chunksize of '2', I was expecting [[1,2],[3,4]] to be sent to one worker and [[5,6],[7,8]] to the second worker giving me "2 2" as an answer.

What am I missing? What does the chunksize do?

SecretAgentMan
  • 2,856
  • 7
  • 21
  • 41
  • chunksize basically means you are passing in chunks of the information, bit by bit – Axois Jul 26 '19 at 13:45
  • @Axois , thanks. I have a list 130 million values that I split up using a grouper function. This creates a list of lists (1 million x 130) that I send to the worker. How do I select the best parameter for the chunk size? – Nicolas Cadieux Jul 26 '19 at 13:54
  • So if I understand correctly, this parameter has nothing to do the the length of the list (number of elements in this list) but with how many bytes of this list are sent to the workers? So If i choose 1024, this list will be sent in bytes of 1024? – Nicolas Cadieux Jul 26 '19 at 14:37
  • the `chunksize` parameters actually takes in your iterables which in this case is your `lst_of_lst` hence having a `chunksize` of 2 means you are splitting your elements in your iterables into 2. As for why u are getting [2 2 2 2] you can print out the `current_process()` to see how many workers are there, but with a list that is that small, it is almost impossible to see that. I suggest printing out the current process for a list that is of range at least 10000. `chunksize` affects the speed at which the iteration occurs. – Axois Jul 26 '19 at 15:41

1 Answers1

0

Your code is only giving you the length of each of the lists inside your bigger list. The argument passed to who_am_I is not the chunk that may have been split off, it's the individual elements of the list. You can use current_process to see who's currently doing the work.

import multiprocessing


def who_am_I(x):
    print(multiprocessing.current_process())

if __name__=='__main__':
    list_of_lists = [[1,2],[3,4],[5,6],[7,8]]
    with  multiprocessing.Pool(8) as pool:
        pool.map(who_am_I, list_of_lists, chunksize=2)

outputs

<ForkProcess(ForkPoolWorker-17, started daemon)>
<ForkProcess(ForkPoolWorker-18, started daemon)>
<ForkProcess(ForkPoolWorker-17, started daemon)>
<ForkProcess(ForkPoolWorker-18, started daemon)>

As you can see only two workers were used given the chunksize.

GreenLlama
  • 123
  • 1
  • 6
  • That’s why I was expecting just 2 answer back and not 4. I tried using the current_process() but I was sometimes getting the different answers if I repeated the same script. – Nicolas Cadieux Jul 26 '19 at 13:59
  • So basically, I was expecting the code to count the list items the first time (with no chunksize) and to count the number of lists with the chunksize of 2. – Nicolas Cadieux Jul 26 '19 at 14:48
  • This is kinda wrong btw, `multiprocessing.Pool(8)` means that there are actually 8 workers, you can see that by increasing the list into a larger array. – Axois Jul 26 '19 at 15:56
  • Well there are 8 workers but only two of them are used right? – GreenLlama Jul 27 '19 at 10:40