Does it make sense to multi-thread within multiprocessing?

Question

With Python's multiprocessing, would it make sense to have a Pool with a bunch of ThreadPools within them? Say I have something like:

def task(path):
  # i/o bound
  image = load(path)
  # cpu bound but only takes up 1/10 of the time of the i/o bound stuff
  image = preprocess(img)
  # i/o bound
  save(image, path)

Then I'd want to process a list of paths path_list. If I use ThreadPool I still end up hitting a ceiling because of the cpu bound bit. If I use a Pool I spend too much dead time waiting for i/o. So wouldn't it be best to split path_list over multiple processes that each in turn use multiple threads?

Another shorter way of restating my example is what if I have a method that should be multithreaded because it's i/o bound but I also want to make use of many cpu cores? If I use a Pool I'm using each core up for a single task which is i/o bound. If I use a ThreadPool I only get to use one core.

I don't have time to write out a full answer just now, but structure your code so that the CPU bound work is scheduled in a `ProcessPoolExecutor`, and the IO bound work is scheduled in a `ThreadPoolExecutor`. Yes it does make sense to use both in some cases. — wim, Mar 18 '21 at 19:57
Thanks, that already helps with the "you don't know what you don't know" bit — Alexander Soare, Mar 18 '21 at 20:06
Another thought to add to the mix is that if you do that, you will end up passing data back and forth between pools. Queues are not miracle performers and you might hit another bottleneck there. Pools implement data transfers as queues and if your files are large (image indicates they might be), you will end up passing the image data through several queues and you would be better off by just choosing one - threads or processes but not passing data between both. — Hannu, Mar 18 '21 at 20:17
@Hannu thanks for that. To clarify about my particular example snippet. There's no communication between processes there right? `path` is a string. The function loads from `path`, does stuff and saves back to it. There's no shared object that needs to be accessed or edited. — Alexander Soare, Mar 18 '21 at 20:25
I don't think I'd choose putting thread-pools into process-pool-workers in any case. IIRC any kind of stdlib-pool already employs three threads just for managing the pool itself. Your idea would hence mean adding much more overhead and overall complexity (think about debugging). For your case, I'd probably just increase the number of pool-workers beyond the number of cores. See this example [here](https://stackoverflow.com/q/53615394/9059420). — Darkonaut, Mar 19 '21 at 01:54
@Darkonaut I think you've nailed it for me. I though that `Pool(n)` meant "Do n processes with 1 thread each (and if you have less than n cores, clip n to that number)" . But what you're saying changes my thoughts. Now `ThreadPool(n)` means: "do n threads but stick to one process", and `Pool(n)` means: "do n threads but do it on as many processes as you can". Is that more or less it? — Alexander Soare, Mar 19 '21 at 09:53
Less it. "`Pool(n)` meant: Do n processes with 1 thread each" - That's exactly what it means, but there's no clipping to number of cores. `Pool(processes)` will just start the number of processes you specify. This doesn't mean all started processes will end up being scheduled again by your OS for running a task, though (for long enough tasks it will happen). — Darkonaut, Mar 19 '21 at 12:35
I'm not sure where exactly your problem in understanding comes from. Maybe this helps: A thread currently executing on a core, halts executing and gets descheduled, either because it doesn't have anything more to do (e.g. waiting for i/o) or because its time slice expires and the scheduler preempts it. — Darkonaut, Mar 19 '21 at 12:36
@Darkonaut Ok, so where am I wrong: 1) `Pool(n)` starts n processes (as you say). 2) Python uses up 1 core per process. Therefore it can only run m processes at once where m is the number of cores. 3) Therefore, if I write `Pool(n)`, and n>m... now I'm stuck. Is it (a) Python ignores n and treats it as m? (you said no) (b) Python is really starting n threads and distributing them over m processes (what I thought, but you said no) (c) It's possible to start n process even with m cores, and I just need to understand what a process really is (d) other — Alexander Soare, Mar 19 '21 at 14:34
Python doesn't manage thread scheduling on cores itself. All it does, is starting OS-processes and every process has at least one thread, which is the actual entity of execution being scheduled to run on a core. The latter is completely an OS responsibility. Every runnable thread created anywhere in your system competes with any other thread for run-time and the OS-scheduler employs some policy to ensure fairness. While a thread is waiting for i/o it isn't runnable and hence doesn't "use up a core". Probably that's where your understanding problem lies. — Darkonaut, Mar 19 '21 at 15:27
(c) is right. (b) A process is the execution-context for one or multiple threads, not a unit of work from the job you send into `Pool`, neither is a thread. So a thread doesn't get "distributed over m processes", your work (tasks of the job you send into the pool) is. — Darkonaut, Mar 19 '21 at 15:28
@Darkonaut right I think I'm there! Pls let me know what's wrong here if anything. 1) So if I have 4 cores and do `Pool(8)`, we start 8 processes. 2) Each of those processes would have 1 thread dedicated to executing my code. 3) Roughly speaking, if those processes are <12.5% CPU bound, it's almost as if I did `ThreadPool(8)` (apart from the overhead). 4) If those processes are more than 12.5% CPU bound and we neglect overhead, then probably `Pool(8)` is better. 5) `ThreadPool(8)` lets us split a task into 8 threads on one process, which can only use up max 1 core — Alexander Soare, Mar 19 '21 at 18:13
Can't edit my last comment anymore, but I thought about it and realised for 3) to be true I need to start with the premise that I have 1 core only — Alexander Soare, Mar 19 '21 at 18:24
1-4 sounds right to me together with the overhead remark. To 5): Not quite, since the chunking algorithm works different (read up [here](https://stackoverflow.com/q/53751050/9059420)), but yes the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock) in CPython limits execution of Python-bytecode _within_ a process to one thread at the same time, that's why we need processes for parallelism. The only thing you can do in parallel with multithreading alone in CPython is waiting, since this doesn't need a core to run. — Darkonaut, Mar 19 '21 at 18:42
@Darkonaut good enough for now. I wish I could buy you a beer or coffee! If you care about the bounty/points happy to accept an answer which says I could achieve my end goal just using `Pool`. — Alexander Soare, Mar 19 '21 at 18:46
Thanks, but for my taste this would be a bit too thin for an answer and I just wanted to nudge you in the right direction anyway. Still the much broader general question you asked in the title here is certainly worth elaborating on, so I'd suggest you leave the bounty open for somebody interested in answering it. — Darkonaut, Mar 19 '21 at 19:08

kmaork · Answer 1 · 2021-12-18T00:06:50.790

Does it make sense

Yes. Let's say you start with one process and one thread. Because some parts of the code block on IO, the process will utilize less than a 100% CPU - so we start adding threads. As long as we see an increase in task throughput, it means the CPU is our bottleneck. At some point, we might hit 100% CPU utilization in our process. Because of the GIL, a pure python process can utilize up to 100% CPU. But, as far as we know, the CPU might still be our bottleneck, and the only way to gain more CPU time is to create another process (or use subinterpreters, but let's ignore that for now).

In summary, this is a valid approach for increasing throughput of pure-python tasks that both utilize CPU and block on IO. But, it does not mean that it is a good approach in your case. First, your bottleneck might be the disk and not the CPU, in which case you don't need more CPU time, which means you don't need more processes. Second, even if the CPU is the bottleneck, multithreading within multiprocessing is not necessarily the simplest solution, the most performant solution, or the winning solution in other resource utilization metrics such as memory usage.

For example, if simplicity is your top priority, you could get all the CPU time you need just by using processes. This solution is easier to implement, but is heavy in terms of memory usage. Or, for example, if your goal is to achieve maximal performance and minimal memory utilization, then you you probably want to replace the threads with an IO loop and use a process pool executor for your CPU-bound tasks. Squeezing maximal performance from your hardware is not an easy task. Below is a methodology that I feel had served me well.

Aiming towards maximal performance

From now on, I'm assuming your goal is to make maximal use of your hardware in order to achieve a maximal throughput of "tasks". In that case, the final solution depends on your hardware, so you'll need to get to know it a little bit better. To try and reach your performance goals, I recommend to:

Understand your hardware utilization
Identify the bottleneck and estimate the maximal throughput
Design a solution to achieve that throughput
Implement the design, and optimize until you meet your requirements

In detail:

1. Understand your hardware utilization

In this case, there are a few pieces of hardware involved:

The RAM
The disk
The CPU

Let's look at one "task" and note how it uses the hardware:

Disk (read)
RAM (write)
CPU time
RAM (read)
Disk (write)

2. Identify the bottleneck and estimate the maximal throughput

To identify the bottleneck, let us calculate the maximum throughput of tasks that each hardware component can provide, assuming usage of them can be completely parallelized. I like to do that using python: (note that I'm using random constants, you'll have to fill in the real data for your setup in order to use it).

# ----------- General consts
input_image_size = 20 * 2 ** 20  # 20MB
output_image_size = 15 * 2 ** 20  # 15MB

# ----------- Disk
# If you have multiple disks and disk access is the bottleneck, you could split the images between them
amount_of_disks = 2
disk_read_rate = 3.5 * 2 ** 30  # 3.5GBps, maximum read rate for a good SSD
disk_write_rate = 2.5 * 2 ** 30  # 2.5GBps, maximum write rate for a good SSD
disk_read_throughput = amount_of_disks * disk_read_rate / input_image_size 
disk_write_throughput = amount_of_disks * disk_write_rate / output_image_size

# ----------- RAM
ram_bandwidth = 30 * 2 ** 30  # Assuming here similar write and read rates of 30GBps
# assuming you are working in userspace and not using a userspace filesystem,
# data is first read into kernel space, then copied to userspace. So in total,
# two writes and one read.
userspace_ram_bandwidth = ram_bandwidth / 3
ram_read_throughput = userspace_ram_bandwidth / input_image_size 
ram_write_throughput = userspace_ram_bandwidth / output_image_size

# ----------- CPU
# We decrease one core, as at least some scheduling code and kernel code is going to run
core_amount = 8 - 1
# The measured amount of times a single core can run the preprocess function in a second.
# Assuming that you are not planning to optimize the preprocess function as well.
preprocess_function_rate = 1000
cpu_throughput = core_amount * preprocess_function_rate

# ----------- Conclusions
min_throughput, bottleneck_name = min([(disk_read_throughput, 'Disk read'),
                                       (disk_write_throughput, 'Disk write'),
                                       (ram_read_throughput, 'RAM read'),
                                       (ram_write_throughput, 'RAM write'),
                                       (cpu_throughput, 'CPU')])
cpu_cores_needed = min_throughput / preprocess_function_rate
print(f'Throughput: {min_throughput:.1f} tasks per second\n'
      f'Bottleneck: {bottleneck_name}\n'
      f'Worker amount: {cpu_cores_needed:.1f}')

This code outputs:

Throughput: 341.3 tasks per second
Bottleneck: Disk write
Worker amount: 0.3

That means:

The maximum rate we can achieve is around 341.3 tasks per second
The disk is the bottleneck. You might be able to increase your performance by, for example:
- Buying more disks
- Using ramfs or a similar solution that avoids using the disk altogether
In a system where all the steps in task are executed in parallel, you won't need to dedicate more than one core for running preprocess. (In python that means you'll probably need only one process, and threads or asyncio would suffice to achieve concurrency with other steps)

Note: the numbers are lying

This kind of estimation is very hard to get right. It's hard not to forget things in the calculation itself, and hard to achieve good measurements for the constants. For example, there is a big issue with the current calculation - reads and writes are not orthogonal. We assume in our calculation that everything is happening in parallel, so constants like disk_read_rate have to account for writes occurring simultaneously to the reads. The RAM rates should probably be decreased by at least 50%.

3. Design a solution to achieve that throughput

Similarly to what you'd offered in your question, my initial design would be something like:

Have a pool of workers load the images and send them on a queue to the next step (we'll need to be reading using multiple cores to use all available memory bandwidth)
Have a pool of workers process the images and send the results on a queue (the amount of workers should be chosen according to the output of the script above. For the current result, the number is 1)
Have a pool of workers save the processed images to the disk.

The actual implementation details will vary according to different technical constraints and overheads you will run into while implementing the solution. Without further details and measurements it is hard to guess what they will be exactly.

4. Implement the design, and optimize until you meet your requirements

Good luck, and be warned that even if you did a good job at estimating the maximal throughput, it might be very hard to get there. Comparing the maximum rate to your speed requirements might give you a good idea of the amount of effort needed. For example, if the rate you need is 10x slower than the maximum rate, you might be done pretty quickly. But if it is only 2x slower, you might want to consider doubling your hardware and start preparing for some hard work :)

Thanks for your response. The reason I haven't accepted it is because it seems to answer the question "How should I find bottlenecks and approach optimisation with multiprocessing?" I'm really just asking about one concept. Darkonaut's comment has been the most useful so far, even though it technically doesn't answer my question as it's worded in the title — Alexander Soare, Mar 24 '21 at 08:19
I have updated my post to contain a more direct answer to your question. — kmaork, Mar 26 '21 at 13:38
@AlexanderSoare , a bit late but I had decided to revisit my post to both improve the quality of my answer and to adress your question more directly. I'd appreciate it if you could read the updated version and give me your feedback :) — kmaork, Dec 18 '21 at 00:10
Thanks for the update. It's hard to compare as I originally read this a while ago now. But I still think this is a really helpful reference. Overall, I never really need to follow a very rigorous process as I mostly make scripts to get from A to B, but your answer gives me the option to revisit and pick the parts that matter to me for a given task — Alexander Soare, Dec 18 '21 at 11:08

score 0 · Answer 2 · answered Mar 23 '21 at 20:27

kmarok's answer is good technical one. But, I would also consider the quote "Premature optimization is the root of all evil" concept.

In short, yes, it make sense. But, do you really need to?

Optimization is a trade off. You compromise code simplicity for better performance. Code simplicity is important; you'll need to further develop, debug, and test your software in the future. This will cost you in time. Simplicity buys you time. You need to be aware of the trade-off when you optimize.

I would first write a multithreaded version and measure it using your hardware. Then I would try the multiprocessing version, and measure it too.

Does any of the versions, is good enough? It might be. If so, you just made your software simpler, more readable and better maintainable.

Well, really Darkonauts comment was the most useful for me as it gave me precisely what I was looking for, even though it's not an exact answer to the exact question in my title. kmaork's answer was more general than I needed and didn't really answer my question. Your's is similar. Both of these answers are trying to advise me on a general approach towards optimisation with multiprocessing. I just wanted to know about a basic concept. — Alexander Soare, Mar 24 '21 at 08:16

score 0 · Answer 3 · answered Mar 25 '21 at 12:23

Chen's and Kamaork's answers resume most of what is needed to know, but there are 2 missing ideas:

Your code will be A process and not THE process, this means that you need to account of how much resources you have left and not how many you can have (it can even happen within your process, threads are not ilimited); this deadly problem happend to me leaving me with less than half of a celeron for a gui, not good.
The biggest optimization with threads you can do is "prediction" (this refers more specifically to when stuff happens), you can chain the threads in a better way when you know how much it takes to compite and its a consisten wait, reading about the tcp window may give you a better idea of how a delay can be optimized by expecting it and not by forcing it.