I am facing the problem to calculate matrix-vector products as fast as possible, where the matrix is strictly real and the vector is complex. To be more precise, the real matrix is (anti-)symmetric and has a sparse structure of non-zero, dense blocks.
My strategy so far was to split the vector into its real and imaginary part and compute for each dense block the matrix-vector products for both real and imaginary parts. As the matrix is (anti-)symmetric, I intend to simultaneously calculate the products for a block and its transpose, so I can just re-use the cache-line where the matrix lies in. So for each block, I compute 4 matrix-vector products for the block and its transpose for each of the real and imaginary part.
My code for calculating these 4 products for a single block ultimately looks like this:
#define no_alias __restrict__
template <typename VecType, typename MatType>
void trans_mul( const VecType* const no_alias re_in,
VecType* const no_alias re_out,
const VecType* const no_alias im_in,
VecType* const no_alias im_out,
const VecType* const no_alias re_tin,
VecType* const no_alias re_tout,
const VecType* const no_alias im_tin,
VecType* const no_alias im_tout,
const MatType* no_alias mat, // the matrix block
const int rows,
const int cols)
{
for(int j = 0; j < cols; ++j) {
for(int i = 0; i < rows; ++i) {
const auto m = *mat++; // this is mat[i, j]
re_tout[j] += m * re_tin[i]; // transposed
im_tout[j] += m * im_tin[i]; // transposed
re_out[i] -= m * re_in[j];
im_out[i] -= m * im_in[j];
}
}
}
The typical matrix size is of order 10^2.
I compile my code using GCC 9.2.1 with -Ofast -march=native
. From the assembly output, I can see that the compiler is automatically vectorizing and using SIMD instructions.
I am competing with a similar code written in Fortran, which is still running about 25% faster. Of course, my code is extremely naive, but still, I could not come up with anything faster than that, as the aggressive optimization seems to be highly effective. I also tried to use four cblas_dgemv
calls, which however was significantly slower than my naive approach. Is there still something left I can do or is there any useful BLAS routine that could suit my case?