Performance degradation with increasing threads in Python multiprocessing

Question

I have a machine with 24 cores and 2 threads per core. I'm trying to optimize the following code for parallel execution. However, I noticed that the code's performance starts to degrade after a certain number of threads.

import argparse
import glob
import h5py
import numpy as np
import pandas as pd
import xarray as xr
from tqdm import tqdm
import time
import datetime
from multiprocessing import Pool, cpu_count, Lock
import multiprocessing
import cProfile, pstats, io


def process_parcel_file(f, bands, mask):
    start_time = time.time()
    test = xr.open_dataset(f)
    print(f"Elapsed in process_parcel_file for reading dataset: {time.time() - start_time}")

    start_time = time.time()
    subset = test[bands + ['SCL']].copy()
    subset = subset.where(subset != 0, np.nan)
    if mask:
        subset = subset.where((subset.SCL >= 3) & (subset.SCL < 7))
    subset = subset[bands]

    # Adding a new dimension week_year and performing grouping
    subset['week_year'] = subset.time.dt.strftime('%Y-%U')
    subset = subset.groupby('week_year').mean().sortby('week_year')
    subset['id'] = test['id'].copy()

    # Store the dates and counting pixels for each parcel
    dates = subset.week_year.values
    n_pixels = test[['id', 'SCL']].groupby('id').count()['SCL'][:, 0].values.reshape(-1, 1)

    # Converting to dataframe
    grouped_sum = subset.groupby('id').sum()
    ids = grouped_sum.id.values
    grouped_sum = grouped_sum.to_array().values
    grouped_sum = np.swapaxes(grouped_sum, 0, 1)
    grouped_sum = grouped_sum.reshape((grouped_sum.shape[0], -1))
    colnames = ["{}_{}".format(b, str(x).split('T')[0]) for b in bands for x in dates] + ['count']
    values = np.hstack((grouped_sum, n_pixels))
    df = pd.DataFrame(values, columns=colnames)
    df.insert(0, 'id', ids)
    print(f"Elapsed in process_parcel_file til end: {time.time() - start_time}")
    return df


def fs_creation(input_dir, out_file, labels_to_keep=None, th=0.1, n=64, days=5, total_days=180, mask=False,
                mode='s2', method='patch', bands=['B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B11', 'B12']):
    files = glob.glob(input_dir)
    times_pool = []  # For storing execution times
    times_seq = []
    cpu_counts = list(range(2, multiprocessing.cpu_count() + 1, 4))  # The different CPU counts to use

    for count in cpu_counts:
        print(f"Executing with {count} threads")
        if method == 'parcel':
            start_pool = time.time()
            with Pool(count) as pool:
                arguments = [(f, bands, mask) for f in files]
                dfs = list(tqdm(pool.starmap(process_parcel_file, arguments), total=len(arguments)))


            end_pool = time.time()
            start_seq = time.time()
            dfs = pd.concat(dfs)
            dfs = dfs.groupby('id').sum()
            counts = dfs['count'].copy()
            dfs = dfs.div(dfs['count'], axis=0)
            dfs['count'] = counts
            dfs.drop(index=-1).to_csv(out_file)
            end_seq = time.time()
            times_pool.append(end_pool - start_pool)  
            times_seq.append(end_seq - start_seq)

    pd.DataFrame({'CPU_count': cpu_counts, 'Time pool': times_pool, 
                  'Time seq' : times_seq}).to_csv('cpu_times.csv', index=False)

    return 0

When executing the code, it scales well up to around 7-8 threads, but after that, the performance starts to deteriorate. I have profiled the code, and it seems that each thread takes more time to execute the same code.

For example, with 2 threads:

Elapsed in process_parcel_file for reading dataset: 0.012271404266357422
Elapsed in process_parcel_file til end: 1.6681673526763916
Elapsed in process_parcel_file for reading dataset: 0.014229536056518555
Elapsed in process_parcel_file til end: 1.5836331844329834

However, with 22 threads:

Elapsed in process_parcel_file for reading dataset: 0.17968058586120605
Elapsed in process_parcel_file til end: 12.049026727676392
Elapsed in process_parcel_file for reading dataset: 0.052398681640625
Elapsed in process_parcel_file til end: 6.014119625091553

I'm struggling to understand why the performance degrades with more threads. I've already verified that the system has the required number of cores and threads.

I would appreciate any guidance or suggestions to help me identify the cause of this issue and optimize the code for better performance.

It's really hard for me to provide a minimal working example so take that into account.

Thank you in advance.

Edit: The files are around 80MB each. I have 451 files. I added the following code to profile the function:

...
    start_time = time.time()
    mem_usage_start = memory_usage(-1, interval=0.1, timeout=None)[0]
    cpu_usage_start = psutil.cpu_percent(interval=None)
    test = xr.open_dataset(f)
    times['read_dataset'] = time.time() - start_time
    memory['read_dataset'] = memory_usage(-1, interval=0.1, timeout=None)[0] - mem_usage_start
    cpu_usage['read_dataset'] = psutil.cpu_percent(interval=None) - cpu_usage_start
...

And more code for each line in a similar fashion. I used the libraries memory_profiler and psutil, and I have the information for each thread. CSV's with the results are available here: https://wetransfer.com/downloads/44df14ea831da7693300a29d8e0d4e7a20230703173536/da04a0 The results identify each line in the function with the number of cpus selected, so each one is a thread.

Edit2:

Here I have a report of a subset of the data, where you can clearly see what each thread is doing, and how some threads are getting less work than others:

https://wetransfer.com/downloads/259b4e42aae6dd9cda5a22d576aba29520230717135248/ae3f88

Can you add more prints or other ways of profiling, to really see what takes so long suddenly. It is hard to say. Numpy uses threads internally, so you can easily overload your system with threads if not careful, but your code seems fine in this regard. Really add a print after like every two or so lines and just see. — Nopileos, Jun 29 '23 at 09:22
@Nopileos Every line in the function seems to scale somewhat equally. I analyzed the median for each cpu_count configuration on the time taken by each thread, and everything seems to go up — Norhther, Jun 30 '23 at 11:06
Disk I/O is always going to be a bottleneck. The disk is not infinitely fast. That's only a minor part of your processing, however. Are the processes competing for memory? Are you reading in huge datasets? How large is each process? — Tim Roberts, Jun 30 '23 at 22:23
So, you are I/O bound. In that case, you cannot run any faster than the disk. Adding more threads or processes will not help. Once the disk gets 100% busy, it can't go any faster. If it takes N seconds to read all of your data, then you can NEVER make your total processing take less than N seconds. It's impossible. — Tim Roberts, Jul 03 '23 at 18:24
@TimRoberts where do you ser the bound? I would understand that if the open_dataset function was taking more time, but its not the case. Every other line is taking more time adding more threads — Norhther, Jul 03 '23 at 19:28
@TimRoberts Also I tried to load in memory all the datasets before calling the Pool, and I had same results, so I don't think I'm bounded in I/O — Norhther, Jul 05 '23 at 07:39
Which operating system? Remember that each pool entry has to start and run a new process with a new memory space, and on Windows that means it has to your script again from the start in every process, so it rereads the file. Python multiprocessing is not the panacea many think it is. — Tim Roberts, Jul 05 '23 at 17:41
I'm willing to bet it's something to do with the xarray package, do you need it? — grantr, Jul 07 '23 at 12:32
@Norhther there's a lot going on in this example. Any way you can boil it down to something more minimal and still reproduce the issue? I think a lot of people would be off-put to answer this with the sheer amount of moving parts here. — flakes, Jul 17 '23 at 23:39
The report from Edit2 does not represent the situation of your question. It seems to be just due to the difference in data given to the threads. The `process_file` calls `process_image_patch` 16 times, but the execution time per `process_image_patch` varies greatly depending on the data. According to your code comment, it depends on whether or not *the patch has enough non-zero labels*. If you count the number of data that meet this condition, you will find that it is proportional to the execution time of each thread. — ken, Jul 22 '23 at 04:29

Maciej Skorski · Answer 1 · 2023-07-01T18:18:17.800

Reading from disk with many parallel processes is bad.

Not only the disk is not well-suited to multiple concurent reads, but using multiple processes incurs communication costs such as the need of serialization.

Perhaps it is better to have less threads reading the data, but benefit from parallel computing on already-read files (or even their chunks, if that's applicable).

See also this article on benchmarking various ways of parallelizing files loading.

Some practice: try first to just read files with multithreading. For 100 files of 500,000 lines each I see the following on my machine:

import multiprocessing as mp
import time


def job(filename):
    _ = pd.read_csv(filename)
    return 'OK'

from glob import glob
tasks = glob('tasks/*csv')

for i in range(1,20):
    start_time=time.time()
    with mp.pool.ThreadPool(i) as pool:
        outs = pool.map(job, tasks)
    print(f'Threads={i} finished in time {time.time() - start_time}')

Threads=1 finished in time 17.437994480133057
Threads=2 finished in time 9.725012063980103
Threads=3 finished in time 6.816091060638428
Threads=4 finished in time 5.413954973220825
Threads=5 finished in time 4.8136584758758545
Threads=6 finished in time 4.713972806930542
Threads=7 finished in time 4.915326356887817
Threads=8 finished in time 4.91536545753479
Threads=9 finished in time 4.6191277503967285
Threads=10 finished in time 4.731768369674683
Threads=11 finished in time 4.846496105194092
Threads=12 finished in time 5.274311542510986
Threads=13 finished in time 4.9633073806762695
Threads=14 finished in time 4.85762357711792
Threads=15 finished in time 4.9740986824035645
Threads=16 finished in time 5.085602760314941
Threads=17 finished in time 4.990613222122192
Threads=18 finished in time 4.979219436645508
Threads=19 finished in time 5.094106435775757

Tried this first loading all the data into memory in the main thread and then launching the other threads with this memory. Did not improve whatsoever. — Norhther, Jul 02 '23 at 01:09
@Norhther Would you mind executing this code - not mixing it with other code of yours - and tell what the numbers look like? — Maciej Skorski, Jul 02 '23 at 06:13
@Norther so you see that threading is more stable like I advised (you don't get 10x overheads any more). Now we need to find a way to parallelize computations. — Maciej Skorski, Jul 07 '23 at 16:06
No, I thought you meant to run the code in your answer, not the original code. I ran my code loading all the data beforehand in the main thread and parallelizing the computations only and the results were similar, 10x overhead — Norhther, Jul 08 '23 at 13:03

score 0 · Answer 2 · answered Jul 07 '23 at 01:00

My first thought was also IO bottleneck. However since you report similar behavior when you load everything into memory ahead of time, I'm going to answer under the assumption that there is no IO bottleneck.

Many of the libraries you are using to do the actual data processing will use multiple CPU cores behind the scenes. You observe that around 7-8 threads performance starts to degrade. I suspect that the CPU is saturated at around 7-8 threads, and beyond that thread contention starts to harm performance.

You can test this. Run with just 1 thread, and monitor CPU utilization per core (e.g. if on Linux, use top or equivalent). See if more than one core being utilized. Then run a series of additional tests with different numbers of threads. If my hypothesis is correct, then once you reach about 7-8 threads, then all cores will be at 100% utilization.

Thanks for your comment, however, it is not the case here... — Norhther, Jul 07 '23 at 12:07

micromoses · Accepted Answer · 2023-07-20T13:29:10.650

TL;DR

Replace xarray.open_dataset with xarray.load_dataset to immediately load the data into memory; the former is quite lazy, while the latter is a bit more eager (a.k.a greedy, strict).
[if this answer is accepted]

Explanation:

[At the risk of repeating previous comments/answers,] This appears to be an I/O throughput limitation, where too many threads/processes attempt to simultaneously read from the same device, ironically slowing things down for all sibling threads/processes.

As for what's causing this, I believe it to be the call to xarray's open_dataset on line 17: test = xr.open_dataset(f), which by some specific interpretation of the docs may appear to be loading the data lazily:

Data is always loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation.

In case this is true, it can explain the symptoms you're exhibiting - 451 files initially get represented by "empty" objects (which takes an almost insignificant amount of time), and as these objects' data is read (or manipulated) almost simultaneously - the storage device gets stormed by tens of thousands (or millions, depending on blocksize) of read requests.

There is a tip, four paragraphs down :

Xarray’s lazy loading of remote or on-disk datasets is often but not always desirable. Before performing computationally intense operations, it is often a good idea to load a Dataset (or DataArray) entirely into memory by invoking the Dataset.load() method.

Alternatively, try using load_dataset.

Performance degradation with increasing threads in Python multiprocessing

3 Answers3

TL;DR

Explanation: