Why does not multithreading speed up my program?

Question

I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().

I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?

def process_file(file_name):
    all_lines = []
    with open(file_name, 'r', encoding='utf8') as f:
        for index, line in enumerate(f):
            line = line.strip()
            all_lines.append(line)
    
    # Classify text
    all_results = []
    with ThreadPoolExecutor(max_workers=10) as executor:
        for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
           all_results.append(result)

    for index, entities_relations_list in enumerate(all_results):
        # print out results

def process_text(text, channel):
    global channel_text
    global non_channel_text
    
    is_right_channel = is_channel(text, channel)

    entities = ()
    relations = None
    entities_relations_list = set()
    entities_relations_list.add((entities, relations))
    if is_right_channel:
        channel_text += 1
        entities_relations_list = get_relations(text, channel)
        return (text, entities_relations_list, is_right_channel)
    non_channel_text += 1
    return (text, entities_relations_list, is_right_channel)

Assuming your Python program is CPU-bound, you are likely seeing the result of the cPython GIL: https://stackoverflow.com/questions/1294382/what-is-the-global-interpreter-lock-gil-in-cpython ... in which case switching to multiprocessing might help. If your program isn't CPU-bound, then it's presumably I/O-bound, in which case performance is limited by the speed of your hard drive, so multiple processes probably won't help. — Jeremy Friesner, Apr 30 '21 at 22:53
@Prune The answer there doesn't explain why there's a performance difference, it just describes the difference (threads share the same memory, processes don't). — Barmar, Apr 30 '21 at 22:54
@JeremyFriesner The only I/O happens before it starts the threads. — Barmar, Apr 30 '21 at 22:55
Unrelated to the question: Why are you using `enumerate()` when you don't use `index`? — Barmar, Apr 30 '21 at 22:56
My interpretation is that OP needed to understand the difference first, to evaluate the design. @marlon, let me know whether I'm wrong here. The post *should* include a fully functional example, including time metrics, from all of your attempts: linear, multi-threading, and multi-processing. *Then* you have a great focus to ask about whatever speed differences you don't understand. — Prune, Apr 30 '21 at 23:02
You need to use a mutex to mediate access to `channel_text` and `non_channel_text`. — Barmar, Apr 30 '21 at 23:02
@Prune He's using the shared variables `channel_text` and `non_channel_text`. Multi-processing will make that more complicated, requiring inter-process communication with the main process. — Barmar, Apr 30 '21 at 23:03
@Barmar these two variables are just 'int' Does it need protection? — marlon, Apr 30 '21 at 23:04
You're processing just one line at a time in each thread. So there's probably lots of overhead in starting/stopping the threads. You should try to chunk them — Barmar, Apr 30 '21 at 23:04
@Prune I thought this is a typical use case, where you process a big file with a function and you want it to be faster. I will try to come up with an example. — marlon, Apr 30 '21 at 23:05
@marlon Yes. Incrementing variables is not atomic. See https://julien.danjou.info/atomic-lock-free-counters-in-python/ — Barmar, Apr 30 '21 at 23:05
@Barmar what do you mean by chunk? Could you write a solution based on my code? It doesn't have to be fully working. — marlon, Apr 30 '21 at 23:07
Thanks; the example can be simply those three implementations: linear, multi-thread, and multi-process. If you can include some execution metrics, so much the better -- we'd like to focus on *your* platform, rather than ours. — Prune, Apr 30 '21 at 23:07
See https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks for how to split a list into chunks. You can then process each chunk in a thread. — Barmar, Apr 30 '21 at 23:09
Since there is no much IO except for the quick reading-in of the lines, this should be CPU bound. So I should choose to use multiprocessing, instead of multithreading. Is that right? — marlon, Apr 30 '21 at 23:15
Reopenned as the linked questions didn't adequately explain the problem as it pertains to this question,and specifically CPython. I think https://stackoverflow.com/questions/10789042/python-multi-threading-slower-than-serial is a better duplicate candidate. However the answer is old and predates `concurrent.futures` (I think). — Dunes, Apr 30 '21 at 23:52

score 0 · Answer 1 · answered May 09 '21 at 17:56

The first thing that should be done is finding out how much time it takes to:

Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)

The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.

Based on timings, we'll get to know:

T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.

Based on above and the file size, you can try a few approaches:

Threading based

Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.

Approach-1 (T1 >> T2 or even when T1 and T2 are similar)

Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue. Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.

The above is example of single Producer and multiple consumer threads.

Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)

Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.

Multiprocessing based

(CPU bound)

Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script

Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.

Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).

Why does not multithreading speed up my program?

1 Answers1

Threading based

Multiprocessing based