≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣ ( How to parallelize this matrix operation ) ≣≣≣≣≣≣≣
Q :
"Is there a way to speed up such an operation?"
A :
Well, here we have two, very different, requests:
(a)
to parallelize something, without seeing any wider context, except knowing about a Python code-execution ecosystem and rough sizing, yet w/o dtype
, and
(b)
to speed up this context-agnostic SLOC computation
ACTUAL HARDWARE ECOSYSTEM - x86-CISC | ARM-RISC | ... - actual NUMA-properties ?

Details matter.
Smart optimisation of large matrix operations can get final processing from N-days into 11~12 minutes (long story here), so indeed, details do matter.
SUMMARY OF KEY PERFORMANCE-LIMITING FACTS :
Given the sizes, there are about 16E3 * 16E3 / 1E9 * 8 * ( 1 + 1 + 1 )
~ 6.144 [GB]
of RAM footprint the operations will have to cover, if 8B
-cells are the dtype
, more if complex128
( or even object
... ), less if any kind of less rich { 4B | ... | 1B }
-data representation ( pity for not having mentioned that in the O/P context ), so let's work with 8B
one. Using a smaller dtype
may not surprisingly mean the resulting process will grow in speed. Having many prior large-scale numpy
-cases, where 8B
-dtype
-s data-cells get actually way faster processed in numerical processes, as CPU-word adaptations were not needed, even at a cost of having twice larger memory-footprints and memory-I/O traffic. Smaller data-footprints were way slower
WHERE ARE OUR MAIN ENEMIES TO FIGHT AGAINST ?
This 6.1 GB
RAM-footprint, given a trivial mathematical "heaviness" of the calculations ( doing just a sum of three, but kind of "wildly"-indexed cell-elements for each target matrix cell-locations, will be our main "ENEMY" in the fight-for-speed - the request under (b)
The "wild" (yet still not as wild as it could be in other cases) indexing will be our second "ENEMY" in doing the performance tweaking
The wish to "parallelize" something inside the python ecosystem introduces problem, as python interpreter, still in 2022-Q1, relies on a pure-[SERIAL]
, any & all concurrency preventing GIL-lock step-dancing. Sure, numpy
, numba
and other non-native Python iterpreted tools help in circumventing the GIL-interleaved, one-block-after-another-block, about 10 [ms] long, chopped code-block interpretations.
This said, the request (a) has no trivial way to meet. Numpy can "escape" from the GIL, yet that is no kind of [PARALLEL]
code-execution. Numpy is smart enough to "vectorise" ( and machine CPU-architecture may get better harnessed to use (or not) CPU-CISC/RISC ILP
& smart AVX-512
, FMA3
, FMA4
and similar CPU-uOps-level available operations, if numpy
was previously setup and compiled to indeed become able to use any of these silicon-based, platform specific
powers ),
nevertheless
none of our main enemies gets solved by whatever numpy
-hard-coded smartness, if we do not help him to work as smart as possible.
RAM-allocations for ad-hoc or interims ( here speaking N x 2.048 GB
RAM-footprints ) are awfully expensive
- so pre-allocate & re-use where possible
2020: Still some improvements, prediction for 2025
-------------------------------------------------------------------------
0.1 ns - NOP
0.3 ns - XOR, ADD, SUB
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
0.9 ns - JMP SHORT
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
?~~~~~~~~~~~ 1 ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
3~4 ns - CPU L2 CACHE reference (2020/Q1)
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
10 ns - DIV
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
| | | |...
| | | |...ns
| | |...us
| |...ms
|...s
RAM-memory-I/O ( speaking N x 2.048 GB
RAM-footprint ) are orders of magnitude beyond any L3 / L2 / L1 Cache sizes so, the in-Cache re-use falls to zero and access times jump 4+ orders of magnitude larger, yes TENS OF THOUSANDS larger - from tenths of [ns]
to hundreds of [ns]
( for details look here )
- so best avoid any kind of looping and best leave numpy
-vectorised tricks to do theirs best in as efficient vectorising & tiling their code-designers were able to invent & implement ( numpy.flags
FORTRAN-v/s-CLANG ordering may help, yet having no outer context of the SLOC, there is no chance to guess if this may help in outer scope operations )
POTENTIAL FOR BOTH FASTEST & PARALLEL SOLUTION
My (localhost non-benchmarkable) candidate would be to :
- use preallocated & re-used ( i.e. destructible )
matrix1, matrix2
+ possible separate & maintained vectors for matrix, matrix2
diagonals
then and only then ( and only if wider context of the SLOC permits )
- may split
[SERIAL]
operations of MUL
and ADD
into a pair of possibly [PARALLEL]
code-execution streams
-- one
to do a vectorised pre-store ( using vector broadcast ) one and inplace add ( matrix1 +=
) the other, separately maintained, diagonal vectors vecDmatrix
, vecDmatrix2
(broadcast and transposed accordingly) into matrix1
-- other
to inplace MUL
matrix2
by -2
~ do a matrix2 *= -2
step separately from other additions
-- finally
upon full completion of both of the mutually independent (thus indeed parallelisable) flow of operations, perform an inplace ADD
, as matrix1 += matrix2
( where
the inplace matrix2 *= -2
has been pre-computed separately, in the previous true-[PARALLEL]-section )
In cases and only in such cases, where setup & other overhead add-on costs of doing this true-[PARALLEL]-section orchestration will not grow over the net effect of doing either part of the work separately, you may enjoy achieving the request-(a) at not damaging the goal of the request-(b), never otherwise.