0

I have written a function for matrix multiplication where matrices are defined using vectors of vectors containing double values.

vector<vector<double> > mat_mul(vector<vector<double> >A, vector<vector<double> >B){
    vector<vector<double> >result(A.size(),vector<double>(B[0].size(),0));
    if(A[0].size()==B.size()){
        const int N=A[0].size();
        for(int i=0;i<A.size();i++){
            for(int j=0;j<B[0].size();j++){
                for(int k=0;k<A[0].size();k++)
                    result[i][j]+=A[i][k]*B[k][j];
            }
        }
    }
    
    return result;
}

The code seems to work fine but is very slow for matrices large as 400X400. How can we speed this up? I have seen other answers but they discuss matrix multiplication but not about any speed up for vector of vectors.

Any help is highly appreciated.

Avii
  • 164
  • 12
  • 4
    You could take the inputs per reference so that they must not be copied every time. – gerum Apr 24 '22 at 08:04
  • 4
    Don't use vectors of vectors, these are bad for multidimensional representations. Rather use a single interned class member vector, and apply the appropriate index navigation (row, col offsets) in the contiguous array of elements. – πάντα ῥεῖ Apr 24 '22 at 08:10
  • Try using Strassen's method. Much faster than the naive approach. – Naseef Chowdhury Apr 24 '22 at 08:44
  • 1
    Please read similar past questions and [this tutorial](https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0). As previously mentioned, use flatten arrays. Please use **BLAS** libraries unless your goal is only to learn how it work (or if it is a homework). There is no chance you can defeat the fastest BLAS implementations (eg. BLIS, OpenBLAS, GotoBLAS, MKL, etc.). – Jérôme Richard Apr 24 '22 at 09:02
  • @JérômeRichard Can the libraries be used for vectors of vectors as storing such a big double array of 400 X 400 was not possible in a 2D array? Therefore I had to use vectors of vectors. – Avii Apr 24 '22 at 09:15
  • @πάνταῥεῖ Do you have any questions or articles showing this? I will try to implement it. I don't have any idea about this. I have recently started coding in C++ – Avii Apr 24 '22 at 09:20
  • 1
    @Avii No generally BLAS libraries do not provided this (they assume a fixed stride on a contiguous flatten array). They do not implement that because this is an inefficient data structure. This is not a problem though, if you *really need* a `vector>` (which is probably not the case), then please copy its content to a flatten array and use a BLAS `dgemm` call. The copy will take a negligible time compared to the matrix multiplication on big matrices (but not on very small ones). For small matrices, the only way to get a fast code is to use a flatten array. – Jérôme Richard Apr 24 '22 at 09:30
  • 1
    @Avii By the way, if you need a matrix object, then please do not reinvent the wheel: use libraries like Eigen that does the job very well. Eigen also enable you to do matrix multiplications efficiently. AFAIR Eigen is header only so it is easy to integrate in existing projects. – Jérôme Richard Apr 24 '22 at 09:34
  • @JérômeRichard Thanks a lot for your kind suggestions!! There is no such compulsion to using a matrix, vectors of vectors or 2D arrays. My prime goal is to store a 400X400 matrix and do matrix multiplication with other 400X400 matrix multiplication. In your previous suggestion, flatten the array did you mean to say to store 401^2 elements in a 1D array and use `dgemm`? After using 'dgemm' call, do I need to need to deflatten it in again 400X400 vectors of vectors data structure? Or do you have any other better way to store such big matrices? And time of execution of code is a big factor for me – Avii Apr 24 '22 at 10:09
  • Naseef, Strassen gives you a factor two if lucky. Deblocking = factor 5, latency removal = factor 4, fma = factor 2, small vectors = factor 4, multithreading = factor 2 to 8. – gnasher729 Apr 24 '22 at 10:28
  • @Avii Yes you need to flatten the two input matrices into two array of size 400*400, then to call the `dgemm`, and then to unflatten the resulting array. For 400x400, doing 3 copy is not great but I am pretty sure this is much better than your current code (and simpler than trying to do all the optimization BLAS does). – Jérôme Richard Apr 24 '22 at 10:29
  • 1
    Jerome, depending on the processor details, converting into a 400x401 or other size array can be faster, without any code change. Powers of two are often bad. – gnasher729 Apr 24 '22 at 10:33
  • @gnasher729 Good point. Cache trashing could make the operation a bit slower with 400 instead of 401. That being said, the operation should be compute bound and the matrix should completely fit in the L3 cache so the effect should be relatively small (even the L2 on new Intel processors). Still, it certainly worth it. Note that the L3 mitigate cache trashing anyway thanks to hashing methods on most architecture (see [this related post](https://stackoverflow.com/questions/71818324)). – Jérôme Richard Apr 24 '22 at 11:03

2 Answers2

1
struct matrix:
    vector<double>
{
    using base = vector<double>
    using size_type=std::pair<std::size_t,std::size_t>;
    
    void resize(size_type const& dims, double const v=0){
        rows = dims.second;  
        base::resize(dims.first * rows);
    };
    size_type size() const { return { cols(), rows }; };

    double& operator[](size_type const& idx) {
        base& vec=*this; 
        return vec[idx.first + idx.second * cols()];
    };
    double operator[](size_type const& idx) const {
        base const& vec=*this; 
        return vec[idx.first + idx.second * cols()];
    };
private:
    std::size_t cols() const { return base::size() / rows; };
    std::size_t rows = 0;
};

///...


auto const size = std::tuple_cat(A.size(), std::make_tuple(B.size().second));

matrix result;
result.resize({get<0>(size), get<2>(size)});

for(auto i = 0; i < get<0>(size); ++i)
    for(auto j = 0; j < get<1>(size); ++j)
        for(auto k = 0; k < get<2>(size); ++k)
            result[{i,k}] += A[{i,j}] * B[{j,k}];

I just skipped lots of details, such as none-dedault constructors which is needed if you want a pretty initialization syntax. Moreover as a matrix, this type will need lots of arithmetics operators. Another approach would be type_erased 2D raw array, but that would require defining assignment operator, as well as copy and move constructors. So this std::vector based solution seems to be the simplest implementation. Also, if the dimensions are fixed at compile-time, a template alias can do:

template<typename T, std::size_t cols, std::size_t rows>
using array_2d = std::array<std::array<double, rows>, cols>;

array_2d<double, icount, kcount> result{};
Red.Wave
  • 2,790
  • 11
  • 17
0

You are using the naive algorithm for matrix multiplication. It’s extremely cache unfriendly, and you are hit by the full latency.

First, split up the operations so your inner loop repeatedly accessed data that fit into your cache.

Second, calculate four sums simultaneously to avoid penalties for latency.

Third, use fma instructions (fused multiply-add) which calculate a product and a sum in the same time as a product.

Fourth, use vector registers.

Five, use multiple threads.

Or just use a package like linpack optimising things for you.

gnasher729
  • 51,477
  • 5
  • 75
  • 98
  • Note that LINPACK is a FORTRAN linear algebra library and a benchmark. It is not meant to compute matrix multiplication. BLAS does. Note also that the LINPACK library has been supplanted by LAPACK several decades ago. The benchmark has also been replaced recently by HPCG (still made by the Jack Dongarra team) which is much better. Despite being obsolete, LINPACK is still quite used as a HPC benchmark today since it favours hardware vendors with bad HPCG results (HPCG shows that most HPC machines are inefficient and LINPACK was clearly not representative of applications -- ie. a bad benchmark). – Jérôme Richard Apr 24 '22 at 10:43