How to parallelize this matrix operation

Question

In the calculation of the resistance distance matrix there is a bottleneck in the final step, which is as follows:

matrix1[i, j] = matrix2[i, i] - 2*matrix2[i, j] + matrix[j, j]

Every algorithm I find parallelizes every step except this one, so right now I just have a double for loop. The matrices are 16000x16000 so this is quite time costly. Is there a way to speed up such an operation?

I think putting the calculations in a `@numba.jit` decorated function could help. — Guimoute, Jan 09 '22 at 21:38

Alain T. · Accepted Answer · 2022-01-09T22:45:08.437

1

How about:

matrix1 = matrix2.diagonal()[:,None] - 2*matrix2 + matrix.diagonal()[None,:]

edited Jan 09 '22 at 22:45

answered Jan 09 '22 at 21:50

Alain T.

40,517
4
31
51

Shouldn't it then be matrix1 = matrix2.diagonal()[:,None] - 2*matrix2 + matrix.diagonal()[None, ;] ? – Caspertijmen1 Jan 09 '22 at 21:54
Oh I see I made a mistake in my original post, [j,i] should have been [j, j]. I edited it :) – Caspertijmen1 Jan 09 '22 at 21:56
Unfortunately this does not yield the right answer – Caspertijmen1 Jan 09 '22 at 22:01
1

@casper: If that is the case then you haven't described the operation you are doing correctly in your question. This does exactly what is written in your question. – talonmies Jan 10 '22 at 04:29
@talonmies It does indeed work. I must have made a mistake in the implementation, thanks! – Caspertijmen1 Jan 10 '22 at 14:59

score 0 · Answer 2 · edited Mar 25 '22 at 21:19

^{≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣≣ ( How to parallelize this matrix operation ) ≣≣≣≣≣≣≣} Q :
_{"Is there a way to speed up such an operation?"}

A :
Well, here we have two, very different, requests:
(a)
to parallelize something, without seeing any wider context, except knowing about a Python code-execution ecosystem and rough sizing, yet w/o dtype, and
(b)
to speed up this context-agnostic SLOC computation

ACTUAL HARDWARE ECOSYSTEM - x86-CISC | ARM-RISC | ... - actual NUMA-properties ?

Details matter.

Smart optimisation of large matrix operations can get final processing from N-days into 11~12 minutes (long story here), so indeed, details do matter.

SUMMARY OF KEY PERFORMANCE-LIMITING FACTS :

Given the sizes, there are about 16E3 * 16E3 / 1E9 * 8 * ( 1 + 1 + 1 ) ~ 6.144 [GB] of RAM footprint the operations will have to cover, if 8B-cells are the dtype, more if complex128 ( or even object ... ), less if any kind of less rich { 4B | ... | 1B }-data representation ( pity for not having mentioned that in the O/P context ), so let's work with 8B one. Using a smaller dtype may not surprisingly mean the resulting process will grow in speed. Having many prior large-scale numpy-cases, where 8B-dtype-s data-cells get actually way faster processed in numerical processes, as CPU-word adaptations were not needed, even at a cost of having twice larger memory-footprints and memory-I/O traffic. Smaller data-footprints were way slower

WHERE ARE OUR MAIN ENEMIES TO FIGHT AGAINST ?

This 6.1 GB RAM-footprint, given a trivial mathematical "heaviness" of the calculations ( doing just a sum of three, but kind of "wildly"-indexed cell-elements for each target matrix cell-locations, will be our main "ENEMY" in the fight-for-speed - the request under (b)

The "wild" (yet still not as wild as it could be in other cases) indexing will be our second "ENEMY" in doing the performance tweaking

The wish to "parallelize" something inside the python ecosystem introduces problem, as python interpreter, still in 2022-Q1, relies on a pure-[SERIAL], any & all concurrency preventing GIL-lock step-dancing. Sure, numpy, numba and other non-native Python iterpreted tools help in circumventing the GIL-interleaved, one-block-after-another-block, about 10 [ms] long, chopped code-block interpretations.

This said, the request (a) has no trivial way to meet. Numpy can "escape" from the GIL, yet that is no kind of [PARALLEL] code-execution. Numpy is smart enough to "vectorise" ( and machine CPU-architecture may get better harnessed to use (or not) CPU-CISC/RISC ILP & smart AVX-512, FMA3, FMA4 and similar CPU-uOps-level available operations, if numpy was previously setup and compiled to indeed become able to use any of these silicon-based, platform specific powers ),
nevertheless
none of our main enemies gets solved by whatever numpy-hard-coded smartness, if we do not help him to work as smart as possible.

RAM-allocations for ad-hoc or interims ( here speaking N x 2.048 GB RAM-footprints ) are awfully expensive
- so pre-allocate & re-use where possible

    2020: Still some improvements, prediction for 2025
    -------------------------------------------------------------------------
                 0.1 ns - NOP
                 0.3 ns - XOR, ADD, SUB
                 0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
                 0.9 ns - JMP SHORT
                 1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
    ?~~~~~~~~~~~ 1   ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
               3~4   ns - CPU L2  CACHE reference           (2020/Q1)
                 5   ns - CPU L1 iCACHE Branch mispredict
                 7   ns - CPU L2  CACHE reference
                10   ns - DIV
                19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
                71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
               100   ns - MUTEX lock/unlock
               100   ns - own DDR MEMORY reference
               135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
               202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
               325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
     |  |  |  |...
     |  |  |  |...ns
     |  |  |...us
     |  |...ms
     |...s

RAM-memory-I/O ( speaking N x 2.048 GB RAM-footprint ) are orders of magnitude beyond any L3 / L2 / L1 Cache sizes so, the in-Cache re-use falls to zero and access times jump 4+ orders of magnitude larger, yes TENS OF THOUSANDS larger - from tenths of [ns] to hundreds of [ns] ( for details look here )
- so best avoid any kind of looping and best leave numpy-vectorised tricks to do theirs best in as efficient vectorising & tiling their code-designers were able to invent & implement ( numpy.flags FORTRAN-v/s-CLANG ordering may help, yet having no outer context of the SLOC, there is no chance to guess if this may help in outer scope operations )

POTENTIAL FOR BOTH FASTEST & PARALLEL SOLUTION

My (localhost non-benchmarkable) candidate would be to :

use preallocated & re-used ( i.e. destructible ) matrix1, matrix2 + possible separate & maintained vectors for matrix, matrix2 diagonals

then and only then ( and only if wider context of the SLOC permits )

may split [SERIAL] operations of MUL and ADD into a pair of possibly [PARALLEL] code-execution streams
-- one
to do a vectorised pre-store ( using vector broadcast ) one and inplace add ( matrix1 += ) the other, separately maintained, diagonal vectors vecDmatrix, vecDmatrix2 (broadcast and transposed accordingly) into matrix1
-- other
to inplace MUL matrix2 by -2 ~ do a matrix2 *= -2 step separately from other additions
-- finally
upon full completion of both of the mutually independent (thus indeed parallelisable) flow of operations, perform an inplace ADD, as matrix1 += matrix2 ( where
the inplace matrix2 *= -2 has been pre-computed separately, in the previous true-[PARALLEL]-section )

In cases and only in such cases, where setup & other overhead add-on costs of doing this true-[PARALLEL]-section orchestration will not grow over the net effect of doing either part of the work separately, you may enjoy achieving the request-(a) at not damaging the goal of the request-(b), never otherwise.

How to parallelize this matrix operation

2 Answers2