0

Here is my code:

from multiprocessing.dummy import Pool
def process_board(elems):
  # do something
for _ in range(1000):
  with Pool(cpu_count()) as p:
    _ = p.map(process_board, enumerate(some_array))

and this is the activity monitor of my mac while the code is running: activity monitor

I can ensure that len(some_array) > 1000, so there is for sure more work that can be distributed, but seems not the case... what am I missing?

Update:
I tried chunking them, to see if there is any difference:

# elements per chunk -> time taken
# 100 -> 31.9 sec
# 50 -> 31.8 sec
# 20 -> 31.6 sec
# 10 -> 32 sec
# 5  -> 32 sec

consider that I have around 1000 elements, so 100 elements per chunk means 10 chunks, and this is my CPU loads during the tests: enter image description here

As you can see, changing the number of chunks does not help to use the last 4 CPUS...

Alberto Sinigaglia
  • 12,097
  • 2
  • 20
  • 48
  • 3
    How long does `process_board` take to finish? Sometimes you can see this behavior if the actual work to be done is quick - by the time you have finished submitting task 6, task 1 is already completed, so task 7 would get scheduled again on the now free core 1. – wim Dec 21 '22 at 23:13
  • 2
    Just in case something fishy is going on: what's the value of `cpu_count`? – Michael Ruth Dec 21 '22 at 23:14
  • How large is `some_array`? `map` will run all of the tasks in `some_array` to completion before returning and letting the next round of the loop run. `map` will chunk data - so, if you have say, 160 tasks on 16 cores, you'll get 10 tasks per core. Now if these tasks vary in execution time, you see some subprocesses finish well before the others. You could experiment with, say, `chunksize=1` in the map call. – tdelaney Dec 21 '22 at 23:17
  • here is a running program that should fully commit your cores. Run it and see what you get. If you do get 100% on all, then the question is how yours is different. https://pastebin.com/DhVHCLJJ – tdelaney Dec 21 '22 at 23:37
  • @wim yes it's pretty simple, I would say almost O(1), it's just that it has to be done hundreds of times – Alberto Sinigaglia Dec 21 '22 at 23:43
  • @MichaelRuth `cpu_count()` returns 10, I have a Macbook Pro M1 Max w/32Gb ram and 24 cores GPU – Alberto Sinigaglia Dec 21 '22 at 23:44
  • 1
    @AlbertoSinigaglia If you add a `time.sleep(2)` into the start of the worker function, do you then see all 10 cores get scheduled? If so, then you have the behavior I mentioned in my earlier comment. You may want to batch together tasks - a process pool executor is not so good for very short tasks. – wim Dec 22 '22 at 03:31
  • @wim thank you so much, I'll test it in a hour and will let you know.. is there a "out of the box" way to batch it (like using processes instead of threads) or should i do something like this? https://stackoverflow.com/questions/312443/how-do-i-split-a-list-into-equally-sized-chunks – Alberto Sinigaglia Dec 22 '22 at 14:13
  • @wim updated the question with chunking... consider that the times are the sum of the times that 10 simulations took, so each simulation takes more or less 3.2 second to finish – Alberto Sinigaglia Dec 22 '22 at 17:15
  • @tdelaney no difference, however i noticed that i'm using `mp.dummy.Pool`, but you are using `mp.Pool`... which does not work on my notebook because has some problem with function references – Alberto Sinigaglia Dec 22 '22 at 17:23
  • @tdelaney yes, that was what was causing all of this, moving the function in a separate file to solve the error I reported before (see https://stackoverflow.com/questions/41385708/multiprocessing-example-giving-attributeerror) now it's taking full advantage of the hardware, please post an answer reporting that the problem was using `mp.dummy.Pool` – Alberto Sinigaglia Dec 22 '22 at 17:54
  • @tdelaney spoiler: using mp.pool is slower because the overhead induced by threads is much more than the speedup – Alberto Sinigaglia Dec 23 '22 at 01:07
  • @AlbertoSinigaglia - I didn't even notice it was mp.dummy, glad you saw that. Performance can be a problem with multiprocessing. If the cost of transferring the data to and from the subprocess is more than the processing, you loose ground. Sometimes you can work around that. If you have lots of data input but small data output (even better if its written to disk) and you are on a unix like system, you can avoid the copy of data to the subprocess and get a speedup. – tdelaney Dec 23 '22 at 23:59

1 Answers1

2

You were using multiprocessing.dummy.Pool which is a thread pool that looks like a multiprocessing pool. This is good for I/O tasks that release the GIL but has no advantage with CPU bound tasks. To note, the python Global Interpreter Lock (GIL) ensures that only a single thread can execute byte code at a time.

Whether multiprocessing speeds things up depends on the cost of sending data to and from the worker subprocesses verses the amount of work done on the data.

tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • thank you... fun fact: using `mp.Pool` at the end used all the cores of my mac, but I have some problem because every now and then they get stuck with no error... so I rewrote the full thing using numpy (and some for loops where needed)... it now uses 60% of the cores and it's 4x faster than using threads... I'm astonished by what C++ code can achieve – Alberto Sinigaglia Dec 24 '22 at 00:11
  • 1
    @AlbertoSinigaglia: Conversely, people who don't normally use Python may be astonished by how high the interpreter overhead is in the CPython interpreter (some other Python implementation are much faster), compared to languages where a `+` operation on integers in the source may compile or JIT to something like an integer `add` instruction that runs directly on the CPU when the program executes. And languages with a thread model that allows parallel computation. Python has some nice features, like arbitrary precision integers, but the run-time cost is *very* high. – Peter Cordes Dec 24 '22 at 04:05
  • 1
    @AlbertoSinigaglia: e.g. [Why are bitwise operators slower than multiplication/division/modulo?](https://stackoverflow.com/q/54047100) shows that Python's interpreter overhead is way higher than the difference in cost between CPU `div` vs. shift or `and` instructions, so special case handling of small numbers makes it actually faster to do division. Which is just a total joke if you know anything about performance. So yeah, NumPY is a good compromise, getting the heavy lifting done outside of pure Python, and can work well when the available functions do what you want. – Peter Cordes Dec 24 '22 at 04:07
  • @PeterCordes love to learn new stuff, as you seem really studied in this area, I would like to ask for a suggestion: I'm coding a reinforcement learning agent, where I need an environment to simulate what happens (thus the question, as you need multiple scenarios at the same time), so for what I can see from your answer, it would be better to create the env in C++, and then create an adapter to import it from python... however I've never tried this, do you have any suggestion/resource you think that would be a good starting point to implement this? – Alberto Sinigaglia Dec 24 '22 at 11:31
  • @AlbertoSinigaglia: I've never done much with Python, certainly not writing a library designed to be called from it. I mostly just worry about making the C/C++/assembly parts run fast :) – Peter Cordes Dec 24 '22 at 11:39
  • @PeterCordes than thank you for your service ahaha – Alberto Sinigaglia Dec 24 '22 at 12:30