Why does multi-processing slow down a nested for loop?

Question

I have a lot of very large matrices AFeatures that I am comparing against some other very large matrices BFeatures, both of which have a shape of (878, 2, 4, 15, 17, 512), using the Euclidean distance. I am trying to parallelise this process to speed up the comparison. I am using Python 3 in a Conda environment and my original code uses an average of two CPU cores at 100%:

    per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
    
    for i in range(878):
        for j in range(878):
            for k in range(2):
                for l in range(4):
                    per_slice_comparisons[i, j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

I have tried two approaches for speeding up the code.

Using multi-processing

def fill_array(i):
    comparisons = np.zeros(shape=(878, 2, 4))

    for j in range(878):
        for k in range(2):
            for l in range(4):
                comparisons[j, k, l] = np.linalg.norm(AFeatures[i, k, l, :] -BFeatures[j, k, l, :])
         comparisons[j, k, l] = 0

         return comparisons

pool = Pool(processes=6)
list_start_vals = range(878)
per_slice_comparisons = np.array(pool.map(fill_array, list_start_vals))
pool.close()

This approach increases run time by around 5%, although all 8 CPU cores are now being used at 100%. I have tried a number of different processes, the more there are the slower it gets.

This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

 os.environ['NUMEXPR_MAX_THREADS'] = '8'
 os.environ['NUMEXPR_NUM_THREADS'] = '4'
 import numexpr as ne

 def linalg_norm(a):
     sq_norm = ne.evaluate('sum(a**2)')
     return ne.evaluate('sqrt(sq_norm)')

 per_slice_comparisons = np.zeros(shape=(878, 878, 2, 4))
     for i in range(878):
         for j in range(878):
             for k in range(2):
                 for l in range(4):
                     per_slice_comparisons[i, j, k, l] = linalg_norm(AFeatures[i, k, l, :] - BFeatures[j, k, l, :])

However, for a nested for loop this approach increases total execution time by a factor of 3. I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically? If anyone has any ideas on how to fix this I would really appreciate it!

Jérôme Richard · Accepted Answer · 2021-05-10T23:34:21.290

4

Why does multi-processing slow down a nested for loop in python?

Creating a process is a very expensive system operation. The operating system has to remap a lot of pages (program, shared library, data, etc.) so that the newly created processes can access to the ones of the initial process. The multiprocessing package also use inter-process communication in order to share the work between processes. This is slow too. Not to mention the required final join operation. To be efficient (ie. reduce the overheads as much as possible), a Python program using the multiprocessing package should share a small amount of data and perform expensive computations. In your case, you do not need the multiprocessing package since you use only Numpy arrays (see later).

This is a slightly different approach where I use the numexpr library to do a faster linal.norm operation. For a single operation this approach reduces runtime by a factor of 10.

Numexpr use threads rather then processes and threads are light compared to processes (ie. less expensive). Numexpr also uses aggressive optimization to speed up the evaluated expression as much as possible (something CPython does not do).

I don't understand why simply putting this operation in a nested for loop would decrease performance so dramatically?

The default implementation of Python is CPython with is an interpreter. Interpreters are generally very slow (especially CPython). CPython perform almost no optimization of your code. If you want fast loops, then you need alternatives that compile them to native code or JIT them. You can use Cython or Numba for that. The two can provide simple ways to parallelize your program. Using Numba is probably the simplest solution in your case. You can start by looking the example programs.

Update: if the implementation of Numpy is multithreaded can, then the multiprocessing code can be much slower. Indeed, each process will create N threads on a machine with N cores. Consequently N*N threads will be run. This situation is called over-subscription and is known to be inefficient (due to preemptive multitasking and especially context-switches). One way to check this hypothesis is to simply look how many threads are created (eg. using the hwloc tool on Posix systems) or simply monitor the processor usage.

edited May 10 '21 at 23:34

answered May 10 '21 at 22:41

Jérôme Richard

41,678
6
29
59

Thanks for your answer! I took a look at the examples on the page you linked and ran them but the results of the default multithreaded example were in my conda environment: numpy (1 thread) 19 ms numba (1 thread) 86 ms numba (4 threads) 33 ms And outside: numpy (1 thread) 92 ms numba (1 thread) 78 ms numba (4 threads) 25 ms – William Smith May 10 '21 at 22:55
This is a bit surprising, I got a speed up on my machine: numpy (1 thread) 61 ms, numba (1 thread) 34 ms, numba (4 threads) 16 ms. I regularly get big speed up with Numba (not only using parallelism). They might be an issue with the version of Numba you use on your machine. If you have some free time, you can open a ticket to the [bug tracker](https://github.com/numba/numba/issues). Otherwise, you can just try Cython (I never saw a Cython code run slower than CPython so far) but I think Numba is more flexible and easy to use. – Jérôme Richard May 10 '21 at 23:06
1

Ha, you use Anaconda! I think Anaconda can use parallelism inside Numpy calls making it faster (not to mention that the built-ins are generally quite well optimized thanks to some Intel packages like the MKL). In that case, the "1 thread" version is does not actually use 1 thread. If you use the Intel Anaconda package, I think another the reason Numpy can be faster in this example is that the Intel MKL has its own very fast math functions (thanks to the Intel SVML) that Numba probably do not use here. As a result, this example may not be representative of all cases. – Jérôme Richard May 10 '21 at 23:17
Yeah I am surprised too, your answer makes sense otherwise. I know numpy is faster on Conda but I wasn't expecting it to be that much faster. Thank you for your suggestions! – William Smith May 10 '21 at 23:17
Yeah I think you are right! Is there any way to speed up numpy even more or is Anaconda giving me everything I can get? – William Smith May 10 '21 at 23:18
Actually, the fact that you Numpy can be multithreaded can have an impact on the multiprocessing code: if each process create N threads on a machine with N cores, then you end up with N*N threads. This case is called over-subscription and is known to be inefficient (one reason is that the OS need to schedule a lot of threads and makes use of slow pre-emption with context-switches). Note that Anaconda can speed up Numpy a lot, but AFAIK not loops, nor any pure-Python code. So Numba can still help you a lot when you know that loops make your code slow. Note that Numba can call Numpy functions. – Jérôme Richard May 10 '21 at 23:23
So I could deactivate the Anaconda environment and try Numba instead? – William Smith May 10 '21 at 23:34
You can use both Anaconda and Numba in most cases (Numba is just a Python package like any other). But if your code is parallelized using Cython, Numba or the multiprocessing package, then you should be very careful. The best would be to control the threading policy of your Anaconda (assuming this is the problem). It seems possible using [environment variables](https://stackoverflow.com/questions/30791550/limit-number-of-threads-in-numpy). – Jérôme Richard May 10 '21 at 23:44
1

Thank you so much! – William Smith May 10 '21 at 23:49
@JérômeRichard "anaconda" is just a distribution of regular CPython. The imporant thing here, with respect to numpy/scipy, is what BLAS/LAPACK backend you are getting. Anaconda will typically be distributed with a good one for whatever system you have, but that has nothing to do with anaconda per se – juanpa.arrivillaga May 11 '21 at 00:25

score 2 · Answer 2 · answered Aug 20 '21 at 18:15

Just a quick update from me on this issue. I found that when calculating the Euclidean distance between different, high dimensional vectors I did get the best results using numpy within Anaconda. Using multiprocessing on top of that did not achieve any significant improvement.

However, I later found the recent Faiss library via a code example (https://github.com/QVPR/Patch-NetVLAD). Faiss (https://anaconda.org/pytorch/faiss-gpu) is a library that is used for clustering and calculating the distance between different vectors and can be used to calculate both the cosine and Euclidean distance. The speed up that can be achieved with this library is, to put it simply, gigantic - well in excess of a factor of 100 speed up for comparing large amounts of highly dimensional matrixes. It has been a total game changer for my research and I would highly recommend it, particularly for comparing large neural network descrciptors.

Why does multi-processing slow down a nested for loop?

2 Answers2

Linked