Is the backend source code for numpy.multiply() setup for multiprocessing/multithreading?

Question

I couldn't find the source code or any documentation describing how the algorithm for np.multiply() is written.

I couldn't find it in the manuals:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.multiply.html

https://docs.scipy.org/doc/numpy-1.9.3/reference/generated/numpy.multiply.html

Does anyone know if the backend source code for np.multiply() is setup for multiprocessing/multithreading? The reason I am asking is because I am writing my own codes to calculate the Kronecker product which uses parallel programming (joblib.Parallel) but when I tested the speed time, np.kron() (which uses np.multiply()) still runs quicker than my code which has parallel programming.

Edit:

This is the code I have written for my Kronecker product:

from itertools import product
from joblib import Parallel, delayed
from functools import reduce
from operator import mul
import numpy as np

lst = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
arr = np.array(lst)
n = 2

def test1(arr, n):
    flat = np.ravel(arr).tolist()
    gen = (list(a) for a in product(flat, repeat=n))

    results = Parallel(n_jobs=-1)(delayed(reduce)(mul, x) for (x) in gen)

    nrows = arr.shape[0]
    ncols = arr.shape[1]

    arr_multi_dim = np.array(results).reshape((nrows, ncols)*n)
    arr_final = np.concatenate(np.concatenate(arr_multi_dim, axis=1), axis=1)

    return arr_final

1) Show what you have done so far. np.multiply is a simple element-wise multiplication of two arrays. You can get easily the same speed as np.multiply in Cython, C (both march=native), or Numba. But there is usually not much speedup possible using multithreading and almost certainly not with multiprocessing, because of the simplicity of the problem (Memory Bandwidth). — max9111, Mar 09 '20 at 13:39
`np.multiply`, the default '*' operation, is compiled code. Apart from some NOGIL directives, it does not use external 'multi-*' packages. The key to its speed is efficient iteration over the elements in C code using strides. For specialized cases you might improve on it using compiling tools like `numba` and `cython`. — hpaulj, Mar 09 '20 at 16:59
Hi @max9111 many thanks for your reply. I have edited my question with the code that I have written so far. Please let me know if you could see anywhere you could see I could make it run quicker. Thanks heaps. — Leockl, Mar 10 '20 at 10:27
Hi @hpaulj, many thanks for your tip and pointing me to the right direction. — Leockl, Mar 10 '20 at 10:28

user3666197 · Accepted Answer · 2020-03-09T12:40:43.690

Your efforts here and here still try to spend more on add-on overheads ( due to process instantiation costs and data-distributing costs of parameters passing there and back ( for a remote calculation step there and results return and consolidation back ) and go in the very opposite direction, than the numpy one.

Efficiency is on the numpy-side, due to a carefully engineered GIL-free core-design, that can also use a vectorised processing ( i.e. to compute more things in a one CPU-instruction step - due to known use of aligned, ILP and AVX-alike processor instrumentation ).

Given these strong advantages + numpy-smart in-place / zero-copy processing ( re-using L1/L2/L3-cached data goes many orders of magnitude faster, than any kind of your attempts to setup and operate a set of distributed-processing, having to pay extra costs for each RAM-copy on SER/DES + IPC-marshall + RAM-copy on SER/DES + compute + RAM-copy on SER/DES + IPC-marshall + RAM-copy on SER/DES ), the smart numpy-based code will almost in all cases beat any other attempt to do the same.

Never forget the add-on costs & _{learn and understand the adverse effects of scaling} :

             0.1 ns - NOP
             0.3 ns - XOR, ADD, SUB
             0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
             0.9 ns - JMP SHORT
             1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
?~~~~~~~~~~~ 1   ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
           3~4   ns - CPU L2  CACHE reference           (2020/Q1)
             5   ns - CPU L1 iCACHE Branch mispredict
             7   ns - CPU L2  CACHE reference
            10   ns - DIV
            19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
            71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
           100   ns - MUTEX lock/unlock
           100   ns - own DDR MEMORY reference
           135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
           202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
           325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
        10,000   ns - Compress 1K bytes with a Zippy PROCESS
        20,000   ns - Send     2K bytes over 1 Gbps  NETWORK
       250,000   ns - Read   1 MB sequentially from  MEMORY
       500,000   ns - Round trip within a same DataCenter
?~~~ 2,500,000   ns - Read  10 MB sequentially from  MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
    10,000,000   ns - DISK seek
    10,000,000   ns - Read   1 MB sequentially from  NETWORK
?~~ 25,000,000   ns - Read 100 MB sequentially from  MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
    30,000,000   ns - Read 1 MB sequentially from a  DISK
?~~ 36,000,000   ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
   150,000,000   ns - Send a NETWORK packet CA -> Netherlands
  |   |   |   |
  |   |   | ns|
  |   | us|
  | ms|

Thanks a million @user3666197 for this explanation. This really helps me in understand how this works. — Leockl, Mar 10 '20 at 09:54

Is the backend source code for numpy.multiply() setup for multiprocessing/multithreading?

1 Answers1

Never forget the add-on costs & learn and understand the adverse effects of scaling :

Never forget the add-on costs & _{learn and understand the adverse effects of scaling} :