67

If I have a pool object with 2 processors for example:

p=multiprocessing.Pool(2)

and I want to iterate over a list of files on directory and use the map function

could someone explain what is the chunksize of this function:

p.map(func, iterable[, chunksize])

If I set the chunksize for example to 10 does that means every 10 files should be processed with one processor?

martineau
  • 119,623
  • 25
  • 170
  • 301
sergio
  • 745
  • 1
  • 6
  • 5

1 Answers1

55

Looking at the documentation for Pool.map it seems you're almost correct: the chunksize parameter will cause the iterable to be split into pieces of approximately that size, and each piece is submitted as a separate task.

So in your example, yes, map will take the first 10 (approximately), submit it as a task for a single processor... then the next 10 will be submitted as another task, and so on. Note that it doesn't mean that this will make the processors alternate every 10 files, it's quite possible that processor #1 ends up getting 1-10 AND 11-20, and processor #2 gets 21-30 and 31-40.

detly
  • 29,332
  • 18
  • 93
  • 152
  • 2
    @newkid - nothing special, whatever you're iterating over will be split into pieces of *approximately* one "thing" per processor. – detly Dec 16 '20 at 14:20
  • is there a way to know what a 'good' chunk size would be to use? e.g. I need to run a simulation 1k times and I have 10 processors to work with... would it just make sense to use a chunk size of 100? that would mean that each processor gets roughly the same amount of simulations to run, right? – David Feb 18 '21 at 21:55
  • 1
    @DavidIreland It's very much "rule of thumb" territory. I wouldn't go straight to number of tasks divided by number of processors, because if one chunk happened to finish early due to parameters or random variation, you've got one processor sitting there idle. – detly Feb 19 '21 at 00:54
  • 2
    @DavidIreland I generally go the other way - if there's not a lot of overhead sending each individual task (eg. you're only passing a couple of parameters by value), use a chunksize of one as your theoretical starting point and think about how long each task takes AND how big the variance is. You want the chunk size to be such that the variation in time matches (to an order of magnitude) the time it takes to process a single chunk. – detly Feb 19 '21 at 00:58
  • 1
    thanks for the reply! before I learnt about chunk sizing I was using the default (1) and it was taking longer than not using MP, presumably because of the overheads. I didn't think about if a process finished due to variance in the simulation that one processor could be left idle so I guess I'll have to choose a chunk size big enough to overcome the overhead of assigning the task to the processor but not so big that a processor could be idle. Thanks a lot! – David Feb 19 '21 at 10:25
  • 1
    Note; the pool chunksize runs a len() on the iterable - meaning it will evaluate a QuerySet if you're using django models. - Might be best to chunk it yourself. – Julian Camilleri Sep 22 '21 at 09:07