numpy: Bottleneck of multiprocessing when memory is not shared and no file IO

Question

The following code (python) measures the speedup when increasing number of processing. The task in the multiprocessing is just multiplying a random matrix, the size of which is also varied and corresponding elapsed time is measured.

Note that, each process does not share any object and they are completely independent. So, I expected that performance curve when changing number of process will be almost same for all matrix size. However, when plotting the results (see below), I found that the expectation is false. Specifically, when matrix size becomes large (80, 160), the performance hardly be better though number of process increased. Note: The figures legend indicates the matrix sizes.

Could you explain, why performance does not become better when matrix size is large?

For your information, here is the spec of my CPU: https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x

Product Family: AMD Ryzen™ Processors
Product Line: AMD Ryzen™ 9 Desktop Processors
# of CPU Cores: 12
# of Threads: 24
Max. Boost Clock: Up to 4.6GHz
Base Clock: 3.8GHz
L1 Cache: 768KB
L2 Cache: 6MB
L3 Cache: 64MB

main script

import numpy as np
import pickle
from dataclasses import dataclass
import time
import multiprocessing
import os
import subprocess
import numpy as np


def split_number(n_total, n_split):
    return [n_total // n_split + (1 if x < n_total % n_split else 0) for x in range(n_split)]


def task(args):
    n_iter, idx, matrix_size = args
    #cores = "{},{}".format(2 * idx, 2 * idx+1)
    #os.system("taskset -p -c {} {}".format(cores, os.getpid()))
    for _ in range(n_iter):
        A = np.random.randn(matrix_size, matrix_size)
        for _ in range(100):
            A = A.dot(A)


def measure_time(n_process: int, matrix_size: int) -> float:
    n_total = 100
    assigne_list = split_number(n_total, n_process)
    pool = multiprocessing.Pool(n_process)
    ts = time.time()
    pool.map(task, zip(assigne_list, range(n_process), [matrix_size] * n_process))
    elapsed = time.time() - ts
    return elapsed


if __name__ == "__main__":
    n_experiment_sample = 5
    n_logical = os.cpu_count()
    n_physical = int(0.5 * n_logical)
    result = {}
    for mat_size in [5, 10, 20, 40, 80, 160]:
        subresult = {}
        result[mat_size] = subresult
        for n_process in range(1, n_physical + 1):
            elapsed = np.mean([measure_time(n_process, mat_size) for _ in range(n_experiment_sample)])
            subresult[n_process] = elapsed
            print("{}, {}, {}".format(mat_size, n_process, elapsed))
    with open("result.pkl", "wb") as f:
        pickle.dump(result, f)

plot script

import numpy as np
import matplotlib.pyplot as plt
import pickle
with open("result.pkl", "rb") as f:
    result = pickle.load(f)

fig, ax = plt.subplots()

for matrix_size in result.keys():
    subresult = result[matrix_size]
    n_process_list = list(subresult.keys())
    elapsed_time_list = np.array(list(subresult.values()))
    speedups = elapsed_time_list[0] / elapsed_time_list
    ax.plot(n_process_list, speedups, label=matrix_size)

ax.set_xlabel("number of process")  
ax.set_ylabel("speed up compared to single process")  

ax.legend(loc="upper left", borderaxespad=0, fontsize=10, framealpha=1.0)
plt.show()

Numpy can use accelerated BLAS libraries for its matrix multiplications. Specifically, it is usually compiled to use OpenBLAS or MKL. Both are internally parallelized. So once the matrices are sufficiently large, you cannot expect a speedup since you just put another set of parallelization on top of the existing one. If anything, you will make it worse — Homer512, Nov 17 '22 at 09:33
Note that although it is indeed certainly an issue for large matrices but 80x80 matrices is not so large so using multiple threads for such matrix should not be very efficient. For OpenBLAS which is the default implementation on most platform the threshold is a bit bigger (see [here](https://stackoverflow.com/questions/72663092)). You can configure the number of thread used by the underlying BLAS (dependent of the target platform). For example `OPENBLAS_NUM_THREADS` for OpenBLAS, `OMP_NUM_THREADS` for OpenMP-based BLAS. If setting it to 1 solve the problem then it is the root cause. — Jérôme Richard, Nov 17 '22 at 13:34
Might be related: [python - Multiprocessing.Pool makes Numpy matrix multiplication slower - Stack Overflow](https://stackoverflow.com/questions/15414027/multiprocessing-pool-makes-numpy-matrix-multiplication-slower) (note that when I run the example code on my machine I notice that when N=1 only 1 CPU core is active, so the reason with BLAS above might not be true...? On the other hand I do see a 2x speed up even with large matrix sizes) — user202729, Nov 18 '22 at 20:26

numpy: Bottleneck of multiprocessing when memory is not shared and no file IO

0 Answers0