1

I know that questions along these lines have been asked before, but all of them seem to be outdated. If I run the code:

import numpy as np

a = np.random.randn(10000, 10000)
b = np.random.randn(10000, 10000)
c = a*b

htop shows that only one core is being used at more or less 100% capacity. So, why doesn't numpy speed up processing by using more than one thread? Is this caused by my instalation of numpy? Or is it because of some dependent library?

I am running this on Fedora 36.

I tried to implement multithreading manually, by splitting arrays and performing operations asynchronously, but the results I got were simply incorrect. Reading through the aforementioned old questions, maybe this has something to do with the BLAS library or its environment variables, but this looks outdated to me: I couldn't find this library in my machine, nor the environment variables.

Any help would be sincerely appreciated.

  • 2
    the linked question it is not outdated, it is still the case, and this question is a diplicate of it. – Ahmed AEK Jan 05 '23 at 19:53
  • 3
    For the first point, there are also more official sources like [this issue](https://github.com/numpy/numpy/issues/12269) (and it is not outdated either). Besides, generatting random number in parallel is not simple as it seems when you care about having deterministic results which is what Numpy does. Parallelism is generally simpler (and actually also often more efficient) when applied at a coarse granularity. Parallelising entirely Numpy is a huge work and it is not as interesting as it seems (for many reasons that would be long to explain here). – Jérôme Richard Jan 05 '23 at 19:58

1 Answers1

2

it's not really beneficial, multithreading the random generator is mentioned in the docs Multithreaded Random Generation and the correct way to multithread numpy elementwise multiplication is as follows:

import numpy as np
from multiprocessing.pool import ThreadPool
from multiprocessing import cpu_count
import time

a = np.random.randn(10000, 10000)
b = np.random.randn(10000, 10000)

threads = 8

def multiply_wrapper(args):
    np.multiply(args[0], args[1], out=args[2])

with ThreadPool(threads) as pool:
    splits_a = np.array_split(a, threads)
    splits_b = np.array_split(b, threads)
    c = np.empty_like(a)  # to hold the results
    splits_c = np.array_split(c, threads)

    t1 = time.perf_counter()
    pool.map(multiply_wrapper, zip(splits_a, splits_b, splits_c))
    t2 = time.perf_counter()
    c2 = a * b
    t3 = time.perf_counter()

    assert np.all(c == c2)  # both methods are equivalent

    print("threaded version", t2 - t1)
    print("serial version", t3 - t2)
threaded version 0.12851539999246597
serial version 0.23459540004841983

note that even though your arrays are really big (10_000, 10_000) multihreading this operation resulted in only a small performance boost, and this is not taking into account the overhead of spawning threads.

on smaller arrays the overhead of spawning threads will be higher than the time taken to do the calculation itself, so just putting threads everywhere is generally not correct and can hurt performance in a lot of cases.

Ahmed AEK
  • 8,584
  • 2
  • 7
  • 23
  • Thank you! I was trying to use multithreading in a fluid simulation that would benefit from that, the example I gave was just an illustration. – Gabriel Franceschi Libardi Jan 05 '23 at 20:15
  • @GabrielFranceschiLibardi consider using Cupy instead, GPUS will do this 10-50 times faster than CPUs, (because of their faster memories), another alternative is using Julia which will in fact be faster than numpy for multithreaded work. – Ahmed AEK Jan 05 '23 at 20:28
  • @Ahmed_AEK thank you for the advice. I will be implementing my simulation for GPUs soon, I am just prototyping for now... – Gabriel Franceschi Libardi Jan 05 '23 at 20:39
  • 1
    At the *same price*, GPUs provide a 10x speed-up, but not much more for memory bound codes. Recent good middle-end PCs with 2x DDR5 RAMs can typically reach a 60 GiB/s throughput while the top of the middle-end GPUs like the RTX 3070 reach ~450 GiB/s and the RTX 3060/3070 Ti reach 600 GiB/s and this is actually pretty expensive GPUs already (>650€-750€). The RAM of Apple M2 CPU can reach 100 GiB/s. For servers, 2 Xeon Platinum 8362 can reach 2x190=380 GiB/s (at 2x6000+1000=13000€ including the 256 GiB of RAM) while an H100 GPU reach 2000 GiB/s (at probably 30000€). – Jérôme Richard Jan 06 '23 at 18:37
  • 1
    For AMD CPUs, The AMD EPYC 9654P costs apparently 11000€ + 2500€ of 256 GiB DDR5 RAM and it can reach ~420 GiB/s alone. Some processor are better for memory bound tasks like the AMD EPYC 9354 which can reach the same throughput for "only" 3500€ (+2500€ of RAM). 2 of them are significantly cheaper than one H100 GPU (12000€ vs 30000€) and the 2 CPUs can reach ~800 GiB/s combined (vs 2000 GiB/s for the GPU). – Jérôme Richard Jan 06 '23 at 18:56
  • 1
    Last but not least, PC GPUs are slow for double-precision (so they are often useless for most physic simulations for example since a high precision is required). CuPy also need a Nvidia GPU (AFAIK it does not work on Intel GPU/iGPU and AMD ones). Server GPU can be efficient for double-precision (most are Nvidia one so it is OK for CuPy). On the CPU side, using Numba is certainly a good start and more flexible than GPUs which are also harder to program. – Jérôme Richard Jan 06 '23 at 19:02