How to speed up sparse array addition

Question

For data simulation purposes I'm looking for an efficient way to do a weighted sum of sparse matrices. Basically I have a data cube of Nx x Ny x Nz double values, where Nx and Ny is in the order of 4000 and Nz is several Millions. All Nx x Ny sub-matrices are very sparse (a block of data in the order of 40 entires). I now want to reduce the data cube in Z direction by adding up all matrices and weight them. The process is illustrated in the figure. For my simulations all matrices stay fix and only the weights will change and generate different Nx x Ny datasets.

This is what I tried: A naive implementation of a sparse matrix in C++, and a simple sum.

#ifndef SPARSEARRAY3D_H
#define SPARSEARRAY3D_H

#include <vector>

struct data{
    unsigned short int x;
    unsigned short int y;
    int z;
    double value;
};

class sparsearray3d
{
public:
    sparsearray3d();
    void createRandomData(int Nx, int Ny, int Nz, int bwidthX, int bwidthY);
    void sumData();
    int Nx,Ny,Nz;

    std::vector<data> dd;
    std::vector<std::vector<double> > image;
    std::vector<double> weights;
};

#endif // SPARSEARRAY3D_H

sparsearray3d.cpp

#include "sparsearray3d.h"
#include <stdlib.h>     /* srand, rand */
#include <stdio.h>      /* printf, scanf, puts, NULL */


sparsearray3d::sparsearray3d()
{
    this->Nx = 0;
    this->Ny = 0;
    this->Nz = 0;
}

void sparsearray3d::createRandomData(int Nx, int Ny, int Nz, int bwidthX = 5, int bwidthY = 5)
{
    // create random data
    this->weights.resize(Nz);

    this->image.resize( Nx , std::vector<double>( Ny , 0. ) );

    this->Nx = Nx;
    this->Ny = Ny;
    this->Nz = Nz;

    for(int i=0; i<Nz; ++i)
    {
        int x0 = rand() % (Nx-bwidthX);
        int y0 = rand() % (Ny-bwidthY);

        this->weights.push_back((double) rand() / (RAND_MAX));

        for(int j=0; j<bwidthX; ++j)
        {
            for(int k=0; k<bwidthY; ++k)
            {
                this->dd.push_back({x0+j,y0+k,i,((double) rand() / (RAND_MAX))});
            }
        }
    }
    printf("Vector size: %4.2f GB \n", this->dd.size()*sizeof(data) * 1E-9);

}

void sparsearray3d::sumData()
{
    std::vector<data>::iterator it;
    #pragma omp parallel for
    for(it = this->dd.begin(); it < this->dd.end(); ++it)
    {
        this->image[it->y][it->x] += it->value * this->weights[it->z];
    }
}

main.cpp

#include <iostream>
#include "sparsearray3d.h"
#include <sys/time.h>

using namespace std;

int main()
{

struct timeval start, end;

sparsearray3d sa;
gettimeofday(&start, NULL);
sa.createRandomData(4096, 4096, 2000000, 4, 16);
gettimeofday(&end, NULL);
double delta = ((end.tv_sec  - start.tv_sec) * 1000000u +
         end.tv_usec - start.tv_usec) / 1.e6;

cout << "random array generation: " << delta << endl;
gettimeofday(&start, NULL);
sa.sumData();
gettimeofday(&end, NULL);
delta = ((end.tv_sec  - start.tv_sec) * 1000000u +
         end.tv_usec - start.tv_usec) / 1.e6;
cout << "array addition: " << delta << endl;
return 0;
}

This does already a good job, the example above runs here at ~0.6s. The first thing I'm wondering about is, why the #pragma omp parallel for gives a speed up of only about a factor of 2 although 4 CPU's are used.

The problem seems to be pretty well suited for massive parallelization . Could Cuda / OpenCL help here? However, I read somewhere that matrix addition is not very efficient with Cuda/ OpenCL. (I don't have a NVIDIA card available though). Alternatively, I read a little about graphs and their relation to matrices. Could that problem be treated with some graph algorithms?

EDIT: I tried to give Eigen a shot; However, I failed to create a large number of matrices. The following code requires much more memory than my code does (and fails for N~ 20000000, as I run out of memory). Not sure I'm doing it right, but that's how I understood it from eigen documentation.

#include <vector>
#include <eigen3/Eigen/Sparse>

int main()
{
int N=100000;
std::vector<Eigen::SparseMatrix<double> > data;

data.resize(N);

for (int i=0; i<N; ++i)
{
    data[i].resize(4096,4096);
    data[i].reserve(4*16);
}

return 0;
}

Also, summing up the sparse matrices in the following way was much slower than my code:

Eigen::SparseMatrix<double> sum(4096,4096) ;
sum.reserve(4096*4096);
for(int i=0; i<N; ++i)
    sum+=data[i];

see edit (eigen eats up my memory and seems to be slow, at least in the (most probably wrong) way I use it) — Julian S., Jun 12 '15 at 03:28
Maybe you can create matrix of lists of items? Every element of these matrix is just list (or array or something) which contains i,j item for every sparse matrix. These approach allow you to do sum of each item independent from other and computation on GPU may give you nice acceleration. Also you resize dd "this->dd.resize" and then added new elements. Are sure? PS Sorry for my poor English — Daiver, Jun 12 '15 at 05:47
thx, I corrected the code. I didn't use pushback in the first place, that's why I still had the dd.resize in there. I'll give your approach a try. — Julian S., Jun 13 '15 at 14:47
In addition to my general and intentionally abstracted answer, you may take a look at one (among many similar) nice Z Boson responce here: http://stackoverflow.com/questions/30838959/parallelizing-matrix-multiplication-through-threading-and-simd . Note that I sort of "don't recommend" to do simdization using intrinsics (because of portability) + the post really is not about your problem statement, but still this is nice way to think about 2-levels parallelism and how to deal with memory wall. — zam, Jun 15 '15 at 20:34

score 1 · Answer 1 · edited May 23 '17 at 10:30

What you deal with is relatively general case of linear algebra matrix-vector multiplication (some relevant computer science abbreviations to search for are "DGEMM" or "SpMV"). So your first option would be to try highly optimized parallel linear algebra libraries, like Intel MKL, see also: Parallel linear algebra for multicore system

Secondly, if you want to optimize and parallelize your algorithm yourself, then you may want first of all to study some relevant articles (like for example: - http://cscads.rice.edu/publications/pdfs/Williams-OptMultiCore-SC07.pdf - http://pcl.intel-research.net/publications/ics26-liuPS.pdf )

Third, few things to consider if you don't want or don't have time to go through books, articles or libraries, but want to experiment with everything from scratch yourself:

What you have is actually fine-grain parallelism (every iteration is too fast == too small), so normalized "per-iteration" threading scheduling overhead could be relatively difficult to "ammortize".
For fine-grain parallelism it's better to try SIMD parallelization (especially since you have potential for "FMA==fused multiply add" here). In OpenMP4.x (supported by ICC and GCC4.9+) you have #pragma omp simd , which works pretty well for simd reductions.
But the right choice would actually be to simplify/reverse engineer your example and make it explicit loopnest with for loop by x (#pragma omp for) and nexted loop by y (#pragma omp simd). Then you will have 2 levels parallelism, which will make you in better shape.
Once you are succesful with all above, you will quickly end-up being bounded either by memory cache/latency, or by memory bandwidth or by irregular access pattern - kind of 3 faces of memory wall. In your current implementation I would expect you to be bounded by bandwidth, however really optimized sparse matrix-matrix products tend to be bounded by various combinations of given 3 faces. This is where reading some articles will be neccesary, and considering "loop blocking", "Structure of Arrays", "prefetching and streaming" topics will become very relevant.

Finally (maybe it's a first item), I don't think that you really have "sparse matrix"-aware data structures here, because you din't really try to avoid "zeros" (they still sit in your matrix-vector). I don't know if you already adopted your data structures for it (so your example just simplifies real code..).

Parallel Linear Algebra is huge topic in computer science; every next year some experts produce new piece of intelligence. And every next generation of CPUs and sometimes co-processors is better optimized to speed-up parallelized linear algebra kernels. So there is plenty stuff to explore.

How to speed up sparse array addition

1 Answers1