0

I am facing the problem to calculate matrix-vector products as fast as possible, where the matrix is strictly real and the vector is complex. To be more precise, the real matrix is (anti-)symmetric and has a sparse structure of non-zero, dense blocks.

My strategy so far was to split the vector into its real and imaginary part and compute for each dense block the matrix-vector products for both real and imaginary parts. As the matrix is (anti-)symmetric, I intend to simultaneously calculate the products for a block and its transpose, so I can just re-use the cache-line where the matrix lies in. So for each block, I compute 4 matrix-vector products for the block and its transpose for each of the real and imaginary part.

My code for calculating these 4 products for a single block ultimately looks like this:

#define no_alias __restrict__

template <typename VecType, typename MatType>
void trans_mul( const VecType* const no_alias re_in,
                VecType* const no_alias re_out, 
                const VecType* const no_alias im_in,
                VecType* const no_alias im_out, 
                const VecType* const no_alias re_tin,
                VecType* const no_alias re_tout,
                const VecType* const no_alias im_tin,
                VecType* const no_alias im_tout,
                const MatType* no_alias mat, // the matrix block
                const int rows, 
                const int cols)
{
    for(int j = 0; j < cols; ++j) {
        for(int i = 0; i < rows; ++i) {
            const auto m = *mat++;        // this is mat[i, j]
            re_tout[j]  += m * re_tin[i]; // transposed
            im_tout[j]  += m * im_tin[i]; // transposed
            re_out[i]   -= m * re_in[j];
            im_out[i]   -= m * im_in[j];
        }
    }
}

The typical matrix size is of order 10^2. I compile my code using GCC 9.2.1 with -Ofast -march=native. From the assembly output, I can see that the compiler is automatically vectorizing and using SIMD instructions.

I am competing with a similar code written in Fortran, which is still running about 25% faster. Of course, my code is extremely naive, but still, I could not come up with anything faster than that, as the aggressive optimization seems to be highly effective. I also tried to use four cblas_dgemv calls, which however was significantly slower than my naive approach. Is there still something left I can do or is there any useful BLAS routine that could suit my case?

Jodocus
  • 7,493
  • 1
  • 29
  • 45
  • 1
    Be *really careful* with `-Ofast` it breaks some language rules to give you speed, so some language guarantees no longer hold and in corner cases you may have weird bugs. Personally I avoid it and stick to benchmarking `-O2` vs `-O3` performance for my code. I'll rather have predictable behaviour over some tiny (usually insignificant or even nonexistent) speedup. – Jesper Juhl Mar 23 '20 at 18:13
  • @JesperJuhl Thanks for the remark, but I am aiming for portability and speed, even if I lose some accuracy on the result (which I do). The speedup difference between `-O3` and `-Ofast` however proved to be highly significant in my case, about 125% faster. – Jodocus Mar 23 '20 at 18:18
  • 1
    the results of `-Ofast` are anything *but* portable. As I said, it discards standard mandated behaviour for speed, so you cannot guarantee the same results with other compilers or the same compiler on other platforms. – Jesper Juhl Mar 23 '20 at 18:21
  • What is the typical size of the computed matrices/vectors? Please read [this](https://stackoverflow.com/questions/27315941/gcc-ofast-complete-list-of-limitations) to check if using `-Ofast` here is OK to you. – Jérôme Richard Mar 23 '20 at 18:41
  • @JérômeRichard They may range from the 100-10k. – Jodocus Mar 23 '20 at 19:01

1 Answers1

1

For quite big matrices (eg. >=1k), you can use register blocking to improve performance (reduces the number of memory load/store compared to arithmetic operations). For small matrices, It is hard to do something better than the original code.

Here is the resulting code with register blocking:

#define no_alias __restrict__
#define mini(a, b) (((a)<(b)) ? (a) : (b))

template <typename VecType, typename MatType>
void trans_mul_v2( const VecType* const no_alias re_in,
                VecType* const no_alias re_out, 
                const VecType* const no_alias im_in,
                VecType* const no_alias im_out, 
                const VecType* const no_alias re_tin,
                VecType* const no_alias re_tout,
                const VecType* const no_alias im_tin,
                VecType* const no_alias im_tout,
                const MatType* no_alias mat, // the matrix block
                const int rows, 
                const int cols)
{
    // Block size (tuned for Clang/GCC on Intel Skylake processors)
    const int si = 16;
    const int sj = 8;

    for(int bj = 0; bj < cols; bj+=sj) {
        for(int bi = 0; bi < rows; bi+=si) {
            if(bi+si <= rows && bj+sj <= cols)
            {
                // The underlying loops are expected to be unrolled by the compiler
                for(int j = bj; j < bj+sj; ++j) {
                    for(int i = bi; i < bi+si; ++i) {
                        const auto m = mat[j*rows+i]; // Assume a column major ordering
                        re_tout[j]  += m * re_tin[i];
                        im_tout[j]  += m * im_tin[i];
                        re_out[i]   -= m * re_in[j];
                        im_out[i]   -= m * im_in[j];
                    }
                }
            }
            else
            {
                // General case (borders)
                for(int j = bj; j < mini(bj+sj,cols); ++j) {
                    for(int i = bi; i < mini(bi+si,rows); ++i) {
                        const auto m = mat[j*rows+i];
                        re_tout[j]  += m * re_tin[i];
                        im_tout[j]  += m * im_tin[i];
                        re_out[i]   -= m * re_in[j];
                        im_out[i]   -= m * im_in[j];
                    }
                }
            }
        }
    }
}

Note that the value of si and sj have a strong impact on the execution time. The optimal value is dependent of the compiler and the underlying architecture. You should probably tune it for the target machine (keep them small if you want performance portability although the performance may not be optimal).

Here are results (with GCC 9 using double precision types):

With a row=100 and cols=100:
trans_mul_v1: 2.438 us
trans_mul_v2: 2.842 us

With a row=1k and cols=1k:
trans_mul_v1: 452 us
trans_mul_v2: 296 us

With a row=10k and cols=10k:
trans_mul_v1: 71.2 ms
trans_mul_v2: 35.9 ms
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59