25

I'm using the Anaconda distribution of Python, together with Numba, and I've written the following Python function that multiplies a sparse matrix A (stored in a CSR format) by a dense vector x:

@jit
def csrMult( x, Adata, Aindices, Aindptr, Ashape ):

    numRowsA = Ashape[0]
    Ax       = numpy.zeros( numRowsA )

    for i in range( numRowsA ):
        Ax_i = 0.0
        for dataIdx in range( Aindptr[i], Aindptr[i+1] ):

            j     = Aindices[dataIdx]
            Ax_i +=    Adata[dataIdx] * x[j]

        Ax[i] = Ax_i

    return Ax 

Here A is a large scipy sparse matrix,

>>> A.shape
( 56469, 39279 )
#                  having ~ 142,258,302 nonzero entries (so about 6.4% )
>>> type( A[0,0] )
dtype( 'float32' )

and x is a numpy array. Here is a snippet of code that calls the above function:

x       = numpy.random.randn( A.shape[1] )
Ax      = A.dot( x )   
AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )

Notice the @jit-decorator that tells Numba to do a just-in-time compilation for the csrMult() function.

In my experiments, my function csrMult() is about twice as fast as the scipy .dot() method. That is a pretty impressive result for Numba.

However, MATLAB still performs this matrix-vector multiplication about 6 times faster than csrMult(). I believe that is because MATLAB uses multithreading when performing sparse matrix-vector multiplication.


Question:

How can I parallelize the outer for-loop when using Numba?

Numba used to have a prange() function, that made it simple to parallelize embarassingly parallel for-loops. Unfortunately, Numba no longer has prange() [actually, that is false, see the edit below]. So what is the correct way to parallelize this for-loop now, that Numba's prange() function is gone?

When prange() was removed from Numba, what alternative did the developers of Numba have in mind?


Edit 1:
I updated to the latest version of Numba, which is .35, and prange() is back! It was not included in version .33, the version I had been using.
That is good news, but unfortunately I am getting an error message when I attempt to parallelize my for loop using prange(). Here is a parallel for loop example from the Numba documentation (see section 1.9.2 "Explicit Parallel Loops"), and below is my new code:

from numba import njit, prange
@njit( parallel=True )
def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape):

    numRowsA = Ashape[0]    
    Ax       = np.zeros( numRowsA )

    for i in prange( numRowsA ):
        Ax_i = 0.0        
        for dataIdx in range( Aindptr[i],Aindptr[i+1] ):

            j     = Aindices[dataIdx]
            Ax_i +=    Adata[dataIdx] * x[j]

        Ax[i] = Ax_i            

    return Ax 

When I call this function, using the code snippet given above, I receive the following error:

AttributeError: Failed at nopython (convert to parfors) 'SetItem' object has no attribute 'get_targets'


Given
the above attempt to use prange crashes, my question stands:

What is the correct way ( using prange or an alternative method ) to parallelize this Python for-loop?

As noted below, it was trivial to parallelize a similar for loop in C++ and obtain an 8x speedup, having been run on 20-omp-threads. There must be a way to do it using Numba, since the for loop is embarrassingly parallel (and since sparse matrix-vector multiplication is a fundamental operation in scientific computing).


Edit 2:
Here is my C++ version of csrMult(). Parallelizing the for() loop in the C++ version makes the code about 8x faster in my tests. This suggests to me that a similar speedup should be possible for the Python version when using Numba.

void csrMult(VectorXd& Ax, VectorXd& x, vector<double>& Adata, vector<int>& Aindices, vector<int>& Aindptr)
{
    // This code assumes that the size of Ax is numRowsA.
    #pragma omp parallel num_threads(20)
    {       
        #pragma omp for schedule(dynamic,590) 
        for (int i = 0; i < Ax.size(); i++)
        {
            double Ax_i = 0.0;
            for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++)
            {
                Ax_i += Adata[dataIdx] * x[Aindices[dataIdx]];
            }

            Ax[i] = Ax_i;
        }
    }
}
littleO
  • 942
  • 1
  • 11
  • 26
  • Have you tried the `parallel=True` keyword argument to the `jit` decorator? I mean annotating it with `@jit(parallel=True)`? – f0xdx Oct 25 '17 at 04:48
  • @fxx I just tried replacing `@jit` with `@jit(parallel=True)`, and when I ran my test code snippet I received the following error message: KeyError: " does not support option: 'parallel'" – littleO Oct 25 '17 at 04:53
  • Yes, this is an experimental feature (and depending on you version of numba might not yet be available). Ok, with that option removed, the next thing I'd try is to port the implementation to `@vectorize` or `@guvectorize` (to generate ufuncs). Maybe you even have to roll out the inner loop into another function for that. – f0xdx Oct 25 '17 at 05:00
  • @littleO Let's be a bit more quantitative in the problem formulation. **How large and how sparse** is the **`A`** matrix ( rows, cols, dtype ) + a ( sparse / dense ) occupancy ratio? N.b.: Trying to compare a MATLAB code-execution with Py3/Numba ecosystem tooling may be a lot misleading. – user3666197 Oct 25 '17 at 08:49
  • @user3666197 I updated the question with some important new information. A has 56,469 rows and 39,279 columns and 142,258,302 nonzero entries (so about 6.4% of its entries are nonzero). The output of type(A[0,0]) is numpy.float32. I wrote a very similar csrMult function in C++ where it was trivial to parallelize the for loop (because C++ supports openMP natively), and my function got about 6 or 7 times faster. I would expect to achieve a similar speedup by parallelizing the for loop in Python when using Numba. – littleO Oct 25 '17 at 09:15
  • **`nopython`** used to be one of the **`numba.jit( ..., nopython = True )`** configuration options, once not willing to let GIL "gymnastics" spoil the numba vectorised-code performance. – user3666197 Oct 25 '17 at 09:43
  • Could you include a random sparse matrix that looks like the one you have? I suspect the problem is that sparse matrices just aren't supported in nopython mode, so numba can't release the GIL (because it has to call back into Python mode) and thus it can't use `prange` or `parallel=True`. – MSeifert Oct 25 '17 at 11:16
  • @MSeifert as Daniel has posted above he uses **csr** matrices from **`scipy.sparse`**, so you can immediately generate any such on your own to test your hypothesis. 56k by 40k sample is not very *practical* to post. – user3666197 Oct 25 '17 at 11:27
  • @MSeifert I'm getting the same error message when I create A using the command `A = scipy.sparse.rand(5000,4000,format='csr')`. And the non-parallel version works when A is created this way. So, you can try constructing A with that command. However, my `csrMult` function only takes numpy arrays (the CSR representation of A) as input, so it seems like sparse matrices do not need to be supported at all for `csrMult` to work. – littleO Oct 25 '17 at 11:28

2 Answers2

26

Numba has been updated and prange() works now! (I'm answering my own question.)

The improvements to Numba's parallel computing capabilities are discussed in this blog post, dated December 12, 2017. Here is a relevant snippet from the blog:

Long ago (more than 20 releases!), Numba used to have support for an idiom to write parallel for loops called prange(). After a major refactoring of the code base in 2014, this feature had to be removed, but it has been one of the most frequently requested Numba features since that time. After the Intel developers parallelized array expressions, they realized that bringing back prange would be fairly easy

Using Numba version 0.36.1, I can parallelize my embarrassingly parallel for-loop using the following simple code:

@numba.jit(nopython=True, parallel=True)
def csrMult_parallel(x,Adata,Aindices,Aindptr,Ashape): 

    numRowsA = Ashape[0]    
    Ax = np.zeros(numRowsA)

    for i in numba.prange(numRowsA):
        Ax_i = 0.0        
        for dataIdx in range(Aindptr[i],Aindptr[i+1]):

            j = Aindices[dataIdx]
            Ax_i += Adata[dataIdx]*x[j]

        Ax[i] = Ax_i            

    return Ax

In my experiments, parallelizing the for-loop made the function execute about eight times faster than the version I posted at the beginning of my question, which was already using Numba, but which was not parallelized. Moreover, in my experiments the parallelized version is about 5x faster than the command Ax = A.dot(x) which uses scipy's sparse matrix-vector multiplication function. Numba has crushed scipy and I finally have a python sparse matrix-vector multiplication routine that is as fast as MATLAB.

littleO
  • 942
  • 1
  • 11
  • 26
  • 3
    A cool piece of news. If this works universally on either of the Intel, AMD, ARM, ... architectures, then the code re-design was indeed a brilliant move. If the trick was to just use the new possibilities, coming from hardware-based extended registers and vectorised operations instructions, not present on other processor architectures, the ARM and might be also the AMD ports will not enjoy the performance that you have enjoyed to observe. Anyway, enjoy the new powers available for further expanding your valued research. – user3666197 Dec 15 '17 at 11:51
  • 1
    Thank you for pointing me to this! I have forwarded a link to the Numba team for their encouragement. – Michael Grant Dec 16 '17 at 19:42
  • @MichaelGrant Awesome, thank you. One thing I didn't mention in this thread is that in Matlab, in order to get A*x to be evaluated in parallel (when A is a large sparse matrix), there is a weird trick where you have to first form ATrans = A' (a one-time cost), and then do ATrans'*x. Without using this trick, it might seem like Matlab does not parallelize sparse matrix-vector multiplication. – littleO Dec 16 '17 at 19:58
  • 2
    @MichaelGrant I have a question for you, if you don't mind. Do you know if Numba provides a way to specify the "chunk size" when using `prange()` to parallelize a `for`-loop? – littleO Dec 18 '17 at 12:44
  • I’m afraid not! – Michael Grant Dec 18 '17 at 15:04
  • 2
    Thinking about it more, it makes sense that `A * x` would be slower in MATLAB than `A' * x`. With CSC storage, `A' * x`, is much easier to parallelize, because each row gets its own thread. – Michael Grant Dec 18 '17 at 16:34
  • I've tried this in a virtualenv on macOS and your function is 500 times slower :o There must be something wrong on my side... – f10w Sep 30 '19 at 15:38
  • Code: https://github.com/netw0rkf10w/python-utils/blob/master/sparse_matvec_product.py Results: `scipy csr took 0.0012090206146240234 (s)` `matvec_fast took 0.5644559860229492 (s)` cc @MichaelGrant – f10w Sep 30 '19 at 15:47
  • @Khue I just ran your code and observed similar runtimes. BUT, I think this time is including Numba's initial compilation time for the `matvec_fast` function, or something like that. Try running lines 32 - 34 more than once within the benchmark function. When I did that just now, I got the following result: `scipy csr took 0.0020189285278320312 (s)` `matvec_fast took 0.36159999999995307 (s)` `matvec_fast took 0.0006539130000646765 (s)` So I observed that Numba was about 3 times faster than scipy in the experiment you set up, if we don't include the initial compilation cost. – littleO Oct 01 '19 at 01:15
  • @littleO Interesting. I also obtained a ~1.5x speedup over Scipy starting from the second call: 1) scipy: 0.001269 (s), matvec: 0.526181 (s) 2) scipy: 0.001331 (s), matvec: 0.000853 (s) 3) scipy: 0.001678 (s), matvec: 0.000889 (s) 4) scipy: 0.001464 (s), matvec: 0.000927 (s) 5) scipy: 0.001758 (s), matvec: 0.001094 (s) (c.f. the updated code in the repo above). Thanks. – f10w Oct 01 '19 at 14:41
  • @littleO FYI I tried the Ahead-of-Time compilation, Numba becomes a bit (1.4x) times slower than Scipy. – f10w Oct 03 '19 at 10:32
  • @Khue Oh interesting. I wonder if that's because of this limitation of ahead of time compilation that I read in the Numba [documentation](https://numba.pydata.org/numba-doc/dev/user/pycc.html): "AOT compilation produces generic code for your CPU’s architectural family (for example “x86-64”), while JIT compilation produces code optimized for your particular CPU model." – littleO Oct 03 '19 at 17:10
  • @littleO Yes that's also what I understood. I ended up not using AOT compilation, but instead adding `cache=True` to avoid compiling each time, which turns out to be the best solution for me. This library works like magic :D – f10w Oct 09 '19 at 08:18
  • 1
    @GeoffreyNegiar I was hesitant to accept my own answer and to undo acceptance on a different answer, but you are right. I just made this the accepted answer. – littleO Jun 18 '20 at 19:08
8

Thanks for your quant updates, Daniel.
The following lines might be hard to swallow, but kindly believe me, there are more things to take into account. I have worked on / / problems
having matrices in the scales ~ N [TB]; N > 10 and their sparse accompanions, so some pieces of experience may be useful for your further views.

WARNING: Do not expect any dinner to be served for free

A wish to parallelise a piece of code sounds like a more and more often contemporary re-articulated mana. The problem is not the code, but the cost of such move.

The economy is the number one problem. Amdahl's Law, as it was originally formulated by Gene Amdahl, did not take into account the very costs of [PAR]-processes-setups + [PAR]-processes-finalisations & terminations, which indeed have to be paid in every real-world implementation.

The overhead-strict Amdahl's Law depicts the scale of these un-avoidable adverse effects and helps understand a few new aspects that have to be evaluated before one opts to introduce parallelisation ( at an acceptable cost of doing so, as it is very, indeed VERY EASY to pay MUCH more than one may gain from -- where a naive disappointment from a degraded processing performance is the easier part of the story ).

Feel free to read more posts on overhead-strict Amdahl's Law re-formulation, if willing to better understand this topic and to pre-calculate the actual "minimum"-subProblem-"size", for which the sum-of-[PAR]-overheads will get at least justified from real-world tools for introducing the parallel-split of the subProblem onto N_trully_[PAR]_processes ( not any "just"-[CONCURRENT], but true-[PARALLEL] -- these are way not equal ).


Python may get a dose of steroids for increased performance:

Python is a great prototyping eco-system, whereas numba, numpy and other compiled-extensions help a lot to boost performance way farther than a native, GIL-stepped python (co-)-processing typically delivers.

Here, you try to enforce numba.jit() to arrange the job almost-for-free, just by its automated jit()-time lexical-analyser ( that you throw your code on ), which ought both "understand" your global goal ( What to do ), and also propose some vectorisation-tricks ( How best assemble a heap of CPU-instructions for maximum efficiency of such code-execution ).

This sounds easy, but it is not.

Travis Oliphant's team has made immense progress on numba tools, but let's be realistic and fair not to expect any form of automated-wizardry to get implemented inside a .jit()-lexer + code-analysis, when trying to transform a code and assemble a more efficient flow of machine instructions to implement the high-level task's goal.

@guvectorize? Here? Seriously?

Due to [PSPACE] sizing, you may immediately forget to ask numba to somehow efficiently "stuff" the GPU-engine with data, a memory-footprint of which is way behind the GPU-GDDR sizings ( not speaking at all about too-"shallow" GPU-kernel sizes for such mathematically-"tiny" processing to just multiply, potentially in [PAR], but to later sum in [SEQ] ).

(Re-)-Loading GPU with data takes heaps of time. If having paid that, the In-GPU memory-latencies are not very friendly for "tiny"-GPU-kernels economy either -- your GPU-SMX code-execution will have to pay ~ 350-700 [ns] just to fetch a number ( most probably not automatically re-aligned for the best coalesced SM-cache-friendly re-use in next steps and you may notice, that you never, let me repeat it, NEVER re-use a single matrix cell at all, so caching per-se will not deliver anything under those 350~700 [ns] per matrix cell ), while a smart pure numpy-vectorised code can process matrix-vector product in less than 1 [ns] per cell on even the largest [PSPACE]-footprints.

That is a yardstick to compare to.

( Profiling would better show here the hard-facts, but the principle is well known beforehand, without testing how to move a few TB of data onto GPU-fabric just to realise this on one's own. )


The worst of the bad news:

Given the memory scales of the matrix A, the worse effect to be expected is, that the sparse-organisation of the storage of the matrix representation will most probably devastate most, if not all, possible performance gains achievable by numba-vectorised tricks on dense matrix representations, as there will likely be almost zero chance for efficient memory fetched cache-line re-uses and sparsity will also break any easy way to achieve a compact mapping of vectorised operations and these will hardly remain able to get easily translated into advanced CPU-hardware vector-processing resources.


Inventory of solvable problems:

  • always better pre-allocate the vector Ax = np.zeros_like( A[:,0] ) and pass it as another parameter into the numba.jit()-compiled parts of the code, so as to avoid repetitive paying additional [PTIME,PSPACE]-costs for creating ( again ) new memory-allocations ( the more if the vector is suspect for being used inside an externally orchestrated iterative optimisation process )
  • always better specify ( to narrow the universality, for the sake of resulting code performance )
    at least the numba.jit( "f8[:]( f4[:], f4[:,:], ... )" )-calling interface directives
  • always review all numba.jit()-options available and their respective default values ( may change version to version ) for your specific situation ( disabling GIL and better aligning the goals with numba + hardware capabilities will always help in numerically intensive parts of the code )

@jit(   signature = [    numba.float32( numba.float32, numba.int32 ),                                   #          # [_v41] @decorator with a list of calling-signatures
                         numba.float64( numba.float64, numba.int64 )                                    #
                         ],    #__________________ a list of signatures for prepared alternative code-paths, to avoid a deferred lazy-compilation if undefined
        nopython = False,      #__________________ forces the function to be compiled in nopython mode. If not possible, compilation will raise an error.
        nogil    = False,      #__________________ tries to release the global interpreter lock inside the compiled function. The GIL will only be released if Numba can compile the function in nopython mode, otherwise a compilation warning will be printed.
        cache    = False,      #__________________ enables a file-based cache to shorten compilation times when the function was already compiled in a previous invocation. The cache is maintained in the __pycache__ subdirectory of the directory containing the source file.
        forceobj = False,      #__________________ forces the function to be compiled in object mode. Since object mode is slower than nopython mode, this is mostly useful for testing purposes.
        locals   = {}          #__________________ a mapping of local variable names to Numba Types.
        ) #____________________# [_v41] ZERO <____ TEST *ALL* CALLED sub-func()-s to @.jit() too >>>>>>>>>>>>>>>>>>>>> [DONE]
 def r...(...):
      ...
user3666197
  • 1
  • 6
  • 50
  • 92
  • 3
    I don't think specifying the signature is a good advise, it prevents optimizations based on the contiguous-ness of the data (sometimes resulting in noticable degraded performance). Also I'm not sure why you mention GPU here. Nothing in the question mentions GPU. – MSeifert Oct 25 '17 at 11:44
  • 1
    But I like the part about the cost of parallel processing, especially the often ignored part that "it is very, indeed VERY EASY to pay MUCH more than one may gain from"! – MSeifert Oct 25 '17 at 11:45
  • Ad GPU) actually it was mentioned in comments above to try **`numba @guvectorize`** tool, so I added a few remarks on hidden extreme costs of ( *also indeed VERY OFTEN mis-used* ) GPU-latency-masking-SMX toys for this sort of problems. GPU can help for "mathematically"-large-GPU-kernels operating on very-compact-&-small-data-region + having minimum, best none, SIMT-synchronisation, but not for anything else. Parallelisation AT ANY COSTS is so, so often these days. **"Ó Tempóra, ó Mórés ..."** :o) – user3666197 Oct 25 '17 at 11:52
  • 1
    Thanks for this detailed answer. One thing to keep in mind is that I wrote a very similar csrMult function in C++, where it was trivial to parallelize the for loop (because C++ supports openMP natively), and by parallelizing the for loop I observed a 6x or 7x speedup, using the same matrix. I would expect a similar speedup here. In any case, I think it should at least be possible to parallelize my for loop using `prange()` without having the code crash. In C++, I only needed to write `#pragma omp parallel for` above the for loop to make the loop execute in parallel. – littleO Oct 25 '17 at 11:54
  • **Ad SIGNATURE(s)**) well, I prefer to avoid `jit()`-code to have multiple code-execution paths just due to having to spend many CPU-clocks on analysing the actual set of parameters once being called after `jit()`-ed. This is a part of a performance, that is lost in cases, where you know a priori what constant data-types are going to be passed. Universality is not a benefit here. **Next - data contiguousness** - it is lost even before we start to think about the best approach, as the O/P has stated he uses sparse-representations ( also was discussed above ). – user3666197 Oct 25 '17 at 11:57
  • Well, you have spent some time in academia / quantitative records of evidence for large scale ( convex ) optimisation problems -- so let me just remark, that even the automation coming from "just" a *need to write* **`#pragma omp parallel for` would hardly break the records as the performance-hurdle is in the cache-line in-efficiency coming from the sparse-representation of data.** This is where the HPC fight for ultimate performance starts. – user3666197 Oct 25 '17 at 12:03
  • @littleO Note bene: could you confirm / clarify -- the C++ code has gained ~6-7x speedups: **[A]** on how many CPU/cores having how large L1 / L2 / L3-caches **`[MB]`**? **[B]** How long in **`[ns]`** did the pure-`[SEQ]`-code-execution take place ( with `/* #pragma omp ... */` )? **[C]** How long, in **`[ns]`** did the `omp parallel for` decorated code-execution take place? – user3666197 Oct 25 '17 at 12:08
  • @user3666197 The number of CPU cores was 20 (40 logical cores). I'll check on the other stuff. – littleO Oct 25 '17 at 12:12
  • @user3666197 I just tested the C++ code again, using an A matrix with 47166 rows, 113954 columns, and 381,190,916 nonzero float64 entries. The serial version of `csrMult` took 1.432 seconds to compute A*x. The version using a parallel for loop took .1712 seconds. So in that case the parallel version was about 8x faster. I measured the time for a single function evaluation by using std::chrono::system_clock::now(); before and after the function call. – littleO Oct 25 '17 at 12:33
  • @littleO thanks for your prompt re-tests, showing a yield of a **speedup ~8.36 x** on 20 physical CPU-cores. **FYI:** there are systems, that can deliver speedups better than ~ 17.5 x on the same 20 physical CPU-cores. Posted in order to show how much it is common to loose in overheads. – user3666197 Oct 25 '17 at 12:38
  • @user3666196 I appreciate your answer very much, but I still do not know how to parallelize my simple embarrassingly parallel python for loop without causing the code to crash, whereas it was trivial to parallelize a similar for loop in C++ and obtain an 8x speedup. I upvoted your answer, but I haven't accepted an answer yet because I've still been hoping someone will post an answer the shows me the correct way to parallelize my for loop when using Numba. It seems like it should be possible to parallelize my for loop using Numba's `prange` function. – littleO Nov 01 '17 at 20:02
  • **Given** the above posted update, you should realise, that in the C-lang **you had manually expressed**, by a use of imperative language constructors and data-layouts, **how the processing ought take place and the only step was left on scheduler, how to harness a pool of omp-threads, again, **explicitly instantiated by you manually** to serve the code-execution, **whereas, your Numba beliefs expect the Just-In-Time Compilation analyser to both understand your intent, setup code for entries/conversions for all possible combinations of calling parameters and also to best optimise a parallel run – user3666197 Nov 01 '17 at 21:41
  • While it is possible to believe into some sort of similar magic, **it is not realistic to expect a Finite-State-Automaton to invent an HPC thing on par with a C-lang code**. If your primary target is performance, **yes, there are tools to get ultimate performance from silicon**, but these will be got from a way different vector of attack, than a ( just ) an 8.36x speedup from 20-C-lang omp-threads on 20 phy-CPU-cores, the less from a JIT-compiler tasked to generate a JIT-LLVM-code for a generic call-signature signed function. Sorry, such belief would be a promise with a zero chance to deliver. – user3666197 Nov 01 '17 at 21:47
  • @user3666197 You might be interested in checking out the answer I just posted based on a recent Numba update. – littleO Dec 15 '17 at 10:27
  • 2
    if I am reading this correctly there seems to be a mistaken assumption that guvectorize decorators imply GPU computation, but this is not correct. Indeed I use such constructs all the time on CPU targets. – Michael Grant Dec 16 '17 at 19:47