I'm trying to figure out a suitable way to apply row-wise permutation of a matrix using SIMD intrinsics (mainly AVX/AVX2 and AVX512).
The problem is basically calculating R = PX where P is a permutation matrix (sparse) with only only 1 nonzero element per column. This allows one to represent matrix P as a vector p where p[i] is the row index of nonzero value for column i. Code below shows a simple loop to achieve this:
// R and X are 2d matrices with shape = (m,n), same size
for (size_t i = 0; i < m; ++i){
for (size_t j = 0; j < n; ++j) {
R[p[i],j] += X[i,j]
}
}
I assume it all boils down to gather, but before spending long time trying implement various approaches, I would love to know what you folks think about this and what is the more/most suitable approach tackling this?
Isn't it strange that none of the compilers use avx-512 for this? https://godbolt.org/z/ox9nfjh8d
Why is it that gcc doesn't do register blocking? I see clang does a better job, is this common?