slow sparse matrix vector product (CSR) using open mp

Question

I am trying to speed up a sparse matrix-vector product using open mp, the code is as follows:

void zAx(double * z, double * data, long * colind, long * row_ptr, double * x, int M){

long i, j, ckey;
int chunk = 1000;
//int * counts[8]={0};
#pragma omp parallel num_threads(8)
{ 
  #pragma omp for private(ckey,j,i) schedule(static,chunk)
  for (i=0; i<M; i++ ){ 
    z[i]=0;
    for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
      j = colind[ckey];
      z[i] += data[ckey]*x[j];
    }              
  }
}
}

Now, this code runs fine, and produces the correct result, it however only gives me a speed up of ~30%. I have checked to see that the threads are all getting about the same number of non-zero elements (they are), and the matrix is fairly large (300,000 x 300,000), so I'm hoping the overhead isn't the only issue. I've also tried running with different chunk sizes and thread numbers, and I get similar performance.

Is there something else I could try to get a bit of extra speed out of this? Or something I'm obviously doing wrong?

Cheers.

Edit: Just commented out '//int * counts[8]={0}', because it was leftover from counting the work allocation. Not needed

Edit2 (more details):

Okay so I timed a loop of calling this 5000 times and got the average times:

seq: 0.0036 (seconds?)
2 threads: 0.002613
4 threads: 0.002308
8 threads: 0.002384

The matrix is of size: 303544x303544 and has: 2122980 non-zero elements.

With a much smaller matrix 30000x30000 I get times that go more like

seq 0.000303
8 threads 0.000078

So it seems the large size may be my issue.

The machine you are running have how many cores? How much is your M — dreamcrash, Nov 29 '12 at 23:30
4 cores (multi-threaded, I've checked omp_max_thread_num), 8gb memory I think. Looking at 'top' the process uses ~700% cpu and 20% ram. — GeorgeWilson, Nov 29 '12 at 23:43
Thanks. There is a little bit of sensitive code I can't copy, but I'll make a stripped down example and get back to you. — GeorgeWilson, Nov 30 '12 at 00:16
Can you tell me how must time, it takes sequencial, 2 threads, 4 threads and 8 threads ? — dreamcrash, Nov 30 '12 at 01:04
I've updated the post with some of this information - sorry about the formatting. Seems the large matrix size might be what is holding me back. — GeorgeWilson, Nov 30 '12 at 03:06
The next step then, is to try and figure out if it's caching or what. I'm not sure how you load and pre-process the matrix but I think maybe you should try to profile your code and find out which part is "stalling" the pipeline. Possibly, you might need to parallel the pre-processing step too. If you have a pre-processor(that say, decompresses the matrix) and all the threads have to wait on it, then that will slow things down significantly. You, could, of course, pre-process in parallel... in any case, you need to determine where your program is spending most of it's time. — jsmdnq, Nov 30 '12 at 17:51
The matrix is actually generated earlier in the program (this cannot be avoided), and that is done in parallel. The timing I am reporting is only for the function I've put in the question. But I should do some more detailed profiling I think you're right. (Sorry for the slow reply, weekend) — GeorgeWilson, Dec 02 '12 at 22:47

score 14 · Accepted Answer · edited Nov 30 '12 at 15:13

Welcome to the wonderful world of memory-bound problems. To relieve you from your pain, I would like to inform you, that sparse matrix-vector multiplication is one of the many things that cannot be effectively parallelised or even vectorised on a single multi-core chip, unless all data could fit in the last level cache or the memory bus is really really wide.

Why? Simply because the ratio of computation to memory access is extremely low. For each iteration of the inner loop you fetch once the column index into j (8 bytes), the matrix element into data (8 bytes), the value of the vector element (8 bytes) and the previous value of the result (since compilers rarely optimise access to shared variables) (8 bytes). Then you perform 2 very fast floating point operations (FLOPs) and perform a store (although the += operator is translated to a single instruction, it is still a "fetch-modify-write" one). In total you load 32 bytes and you perform 2 FLOPs over them. This makes 1/16 FLOPs per byte.

A modern SSE-capable CPU core can perform 4 double-precision FLOPs/cycle, which often results to something like 8 GFLOPS per CPU core (assuming basic frequency of 2 GHz). With AVX the number is doubled, so you get up to 16 GFLOPS per core on a 2 GHz Intel Sandy/Ivy Bridge or AMD equivalent. In order to saturate this processing power with data, given the 1/16 FLOPs/byte, you would need at least 128 GiB/s of memory bandwidth.

A high-end Nehalem-EX processor like Xeon X7560 runs at 2,26 GHz (9,04 GFLOPS/core) and its shared L3 cache (L1 and L2 caches are per-core) delivers about 275 GiB/s. At 9,04 GFLOPS/core, you need 144,64 GiB/s per core in order to feed the inner loop of your zAx routine. This means that in the ideal case, the L3 cache of this CPU cannot feed more than 2 fully vectorised multiplication kernels.

Without SSE vectorisation the FLOPS rate is twice as lower for double precision, so one could expect the problem to scale up to 4 threads. Things get extremely bad once your problem becomes larger than the L3 cache as the memory bus delivers about ten times less bandwidth.

Try the following version of the inner loop just to see if the compiler is smart enough to follow the relaxed memory view of OpenMP:

#pragma omp for private(ckey,j) schedule(static,chunk)
for (i=0; i<M; i++){
  double zi = 0.0;
  for (ckey=row_ptr[i]; ckey<row_ptr[i+1]; ckey++) {
    j = colind[ckey];
    zi += data[ckey]*x[j];
  }
  z[i] = zi;
}

Unfortunately there is nothing more that you can do. Sparse matrix-vector multiplication scales with the number of CPU sockets, not with the number of CPU cores. You would need a multi-socket system with separate memory controllers, e.g. any system with more than one (post-)Nehalem or AMD64 processors.

Edit: optimisation hint. Do you really need long to store the column index and the row pointers? With 2122980 non-zero elements you would be pretty fine using int instead. It would save loading 4 bytes per element in the inner loop and another 4 bytes per row in the outer loop.

For this type of problem it is not preferable to perform an MPI parallelization since it will somewhat relieve the memory bound issues? — dreamcrash, Nov 30 '12 at 14:26
@dreamcrash, MPI would not help if you only have a single-chip multi-core system like the OP does. SMVM simply benefits from having more chips with more memory controllers - it doesn't really matter if you program them with OpenMP, MPI or something like OpenMP + virtual SMP. — Hristo Iliev, Nov 30 '12 at 15:15
Thanks Hristo - that is a very helpful exposition on the problem. I actually tried your modified inner loop earlier on (with the same hope) and saw no speed up. Will try switching away from longs (they're a leftover from the fact that python seems to use longs instead of ints, and this is in an extension module). — GeorgeWilson, Dec 02 '12 at 22:53
On the topic of MPI - Could you provide any references for how in particular sparse matrix-vector multiplication might best be performed if I were to have access to a cluster? I would prefer not to move to this, but it could be an option — GeorgeWilson, Dec 02 '12 at 22:54
@thebigdog, with MPI you do it the same way as with OpenMP - distribute the rows of the matrix to the processes in the MPI job. The kernel remains essentially the same as the one in your OpenMP code. Then collect the pieces of the result vector with `MPI_Gather[v]`. — Hristo Iliev, Dec 03 '12 at 06:53
@HristoIliev: Okay thanks, I'll give that a go if I get a chance. — GeorgeWilson, Dec 03 '12 at 08:24

jsmdnq · Answer 2 · 2012-11-30T01:18:10.530

3

I can't write this in a comment so I'll do it as an answer. I think it is the problem but not 100% sure.

Shared variables across threads can cause issues. I do not think it is a problem here but might be. Usually only when writing, but if there are no locks then it will just result in corrupt data. Not sure if OpenMP does any locking internally or not.

Chances are your threads are stalling due to locks which is the only reason why multi-threading runs much slower in proportion to a single thread. That or it is not your code at all. It is best if you test it on a small data set that is in memory with no potentially for bottlenecks(so all you're doing is processing the data and timing only the zAx function).

0.3M^2 = 90B. Which means you are definitely going to have issues with either paging or loading the files. (and if you are using int's that 4's times the size)

A better approach might be to operate on X amount of the matrix while the disk loads Y amount in parallel. By choosing X and Y correctly you will not have much reduction in speed. If you are loading 8GB, processing, then loading 8GB more you have to wait each time to load the data.

You can make the processing intelligent by choosing X and Y = (8GB - X) by monitoring the time the processing and file loading are doing nothing.

To check and see if the disk access is the problem try using a smaller dataset and time only zAx and see if that helps. If it does then it's the disk. If not, it's the code.

edited Nov 30 '12 at 01:18

answered Nov 30 '12 at 00:08

jsmdnq

363
2
10

Thanks for your input. I'll give your suggestions a go, but I don't think that size is as big an issue as you suspect. The matrix is stored as a CSR sparse matrix, and is mostly zero. Also, I don't think I am increasing row_ptr? It being accessed almost at random, may be a problem however. – GeorgeWilson Nov 30 '12 at 00:48
I did like your answer except for this "You are incrementing row_ptr which is a shared value in memory by all threads". I is not incrementing, and row_ptr[i+1] is just be read not write, so theres no problem there. – dreamcrash Nov 30 '12 at 01:07
@dreamcrash right, not sure what I was seeing ;/ (tired I guess ;) – jsmdnq Nov 30 '12 at 01:11
@thebigdog Yeah, my mistake, blurry vision when tired and head hurts ;/ Ok, if it's relatively sparse I imagine you simply expand it from a memory buffer into the values currently being used? If so, then maybe that is also a problem? What I would do if I were you is try to find the bottleneck. Check your function itself on a smaller in memory(a few gigs) matrix and compare it on 1 and 10 threads. The 1 thread should be much slower(at least a factor of 2). If not then it's the code. If it is much faster then your code should be fine and it's dealing with the data. – jsmdnq Nov 30 '12 at 01:14
I've done similar things but on smaller sets and I've always gotten a speedup of several factors. Your function looks relatively simple so chances are it is something external to it. All I can really say is you should do some testing. Write a test program that is very simple that will only profile the function itself(and nothing else). If you have issues with that code can then use it to determine what is causing the problem. – jsmdnq Nov 30 '12 at 01:21
I think you're right jsmdnq - As I've said in the updated question, a much smaller matrix does give more reasonable speed ups (although still a bit less than I hoped). – GeorgeWilson Nov 30 '12 at 03:07
Your answer is correct in general, but unfortunately does not touch the specifics of the sparse matrix-vector multiplication problem from the question. People often underestimate the memory requirements of their codes and/or overestimate the memory capabilities of their machines... – Hristo Iliev Nov 30 '12 at 12:59
@HristoIliev Initially I did not know it was a sparse matrix(I was tired and probably overlooked a lot of stuff). In any case I believe there are a lot of mathematical techniques to deal with sparse matrix calculations. Probably a good way to check the performance is to do the same thing in matlab. I imagine, if the matrix is sparse enough, a divide and conquer technique can be used. Dividing the matrix up into blocks and using something like http://en.wikipedia.org/wiki/Block_matrix. – jsmdnq Nov 30 '12 at 13:50
No, the problem is much more fundamental - see my answer. Sparse matrix-vector multiplication is slow (relative to the compute capabilities of modern CPUs) and after some time in the HPC business one simply learns to live with that fact. Of course, it is only relatively slow because one also have to weight in its benefits like the massively reduced matrix storage requirements. – Hristo Iliev Nov 30 '12 at 15:12
@HristoIliev I think your terminology is a bit off. You say a sparse matrix is slow YET a sparse matrix is a matrix and hence can't be any slower than general matrix multiplication, which is the slowest of them all. Sparse matrices, by their nature, have certain properties that may allow optimization. http://en.wikipedia.org/wiki/Sparse_matrix under Solving sparse matrix equations. (i.e., there are dedicated solutions to sparse matrices which should increase the solution speed) – jsmdnq Nov 30 '12 at 17:45
Also, block multiplication is much faster than multiplication of corresponding matrices. The issue, then is, if the sparse matrix can be codified as a matrix of blocks quicker than carrying out the individual multiplications. Obviously, any matrix is very large will run into memory issues, sparse or not. (but again, sparse matrices are special in certain ways and the goal is to use those characteristics in an effective way to do the multiplication) – jsmdnq Nov 30 '12 at 17:48

slow sparse matrix vector product (CSR) using open mp

2 Answers2

Linked