22

I need frequent usage of matrix_vector_mult() which multiplies matrix with vector, and below is its implementation.

Question: Is there a simple way to make it significantly, at least twice, faster?

Remarks: 1) The size of the matrix is about 300x50. It doesn't change during the run. 2) It must work on both Windows and Linux.

double vectors_dot_prod(const double *x, const double *y, int n)
{
    double res = 0.0;
    int i;
    for (i = 0; i < n; i++)
    {
        res += x[i] * y[i];
    }
    return res;
}

void matrix_vector_mult(const double **mat, const double *vec, double *result, int rows, int cols)
{ // in matrix form: result = mat * vec;
    int i;
    for (i = 0; i < rows; i++)
    {
        result[i] = vectors_dot_prod(mat[i], vec, cols);
    }
}
Serg
  • 13,470
  • 8
  • 36
  • 47
  • 4
    I believe there are SIMD instructions specifically designed for doing dot products but I may be mistaken. – Sean Bright Sep 05 '12 at 20:29
  • @SeanBright, dot products is indeed, the bottleneck here – Serg Sep 05 '12 at 20:30
  • 6
    typically, it's hard to beat an optimized `BLAS` implementation, and `dgemv` seems to be what you're looking for – ev-br Sep 05 '12 at 20:35
  • 2
    If you do not want to change your code by using optimized libraries or hand-coded SSE/AVX instructions, then compiler switches and pragmas could help you in tuning the code. Perhaps `Table 4: Compiler Hints for Intra-Register Vectorization` would be helpful - http://software.intel.com/en-us/articles/vectorization-with-the-intel-compilers-part-i/ – Sayan Sep 05 '12 at 20:42
  • 1
    You could try to use multiple CPUs by computing more then one result at the same time. – Fozi Sep 05 '12 at 20:54
  • Possible duplicate of [Speed up matrix multiplication](http://stackoverflow.com/questions/4300663/speed-up-matrix-multiplication) for C++ | http://stackoverflow.com/questions/1907557/optimized-matrix-multiplication-in-c for C – Ciro Santilli OurBigBook.com Mar 13 '17 at 09:29

3 Answers3

25

This is something that in theory a good compiler should do by itself, however I made a try with my system (g++ 4.6.3) and got about twice the speed on a 300x50 matrix by hand unrolling 4 multiplications (about 18us per matrix instead of 34us per matrix):

double vectors_dot_prod2(const double *x, const double *y, int n)
{
    double res = 0.0;
    int i = 0;
    for (; i <= n-4; i+=4)
    {
        res += (x[i] * y[i] +
                x[i+1] * y[i+1] +
                x[i+2] * y[i+2] +
                x[i+3] * y[i+3]);
    }
    for (; i < n; i++)
    {
        res += x[i] * y[i];
    }
    return res;
}

I expect however the results of this level of micro-optimization to vary wildly between systems.

6502
  • 112,025
  • 15
  • 165
  • 265
  • Can you post the code on ideone please? I tried this myself and didn't get any difference whatsoever (even a bit slower with the unrolled loops) – Luchian Grigore Sep 05 '12 at 20:52
  • 2
    @LuchianGrigore: http://ideone.com/JXXtn , and there too seems to run about twice as fast – 6502 Sep 05 '12 at 20:54
  • I guess MSVS does it automatically, that's why there was no difference for me. +1 – Luchian Grigore Sep 05 '12 at 21:08
  • That depends on the optimization level. Using gcc 4.4 on Ubuntu Lucid and -O3 brings the difference down to some 20% (from 14us to 10us). – ev-br Sep 05 '12 at 21:12
  • @Zhenya still something - though I compiled with full optimizations. – Luchian Grigore Sep 05 '12 at 21:12
  • 1
    @LuchianGrigore using `g++ -O3 -funroll-loops` already brings it down to 11.6us vs 10.6us. There sure is more room for playing with compile line options :-) – ev-br Sep 05 '12 at 21:19
  • My problem: - simple: 11.5 sec - 4: 7.2 sec - 8: 6.5 sec - 16: 5.6 sec with parentheses around blocks multiple of 2. NICE!!! – F. Privé Dec 28 '16 at 10:47
  • Yeah, for some really dumb reason gcc has a hard time vectorizing the inner loop. – Benjamin Feb 01 '18 at 23:08
5

As Zhenya says, just use a good BLAS or matrix math library.

If for some reason you can't do that, see if your compiler can unroll and/or vectorize your loops; making sure rows and cols are both constants at the call site may help, assuming the functions you posted are available for inlining

If you still can't get the speedup you need, you're looking at manual unrolling, and vectorizing using extensions or inline assembler.

Useless
  • 64,155
  • 6
  • 88
  • 132
0

If the size is constant and known in advance, pass it in as a precompiler variable, which will permit the compiler to optimize more fully.

djechlin
  • 59,258
  • 35
  • 162
  • 290
  • The size is achieved from parameters, during initialization. I can't make it hard-coded. – Serg Sep 05 '12 at 20:58
  • Poor man's memoization: Write your own that works for your expected code, first line of your production one will direct to your function if the parameters just happen to be the ones you are expecting. – djechlin Sep 05 '12 at 21:03