1

I'm trying to figure out a suitable way to apply row-wise permutation of a matrix using SIMD intrinsics (mainly AVX/AVX2 and AVX512).

The problem is basically calculating R = PX where P is a permutation matrix (sparse) with only only 1 nonzero element per column. This allows one to represent matrix P as a vector p where p[i] is the row index of nonzero value for column i. Code below shows a simple loop to achieve this:

// R and X are 2d matrices with shape = (m,n), same size
for (size_t i = 0; i < m; ++i){
    for (size_t j = 0; j < n; ++j) {
        R[p[i],j] += X[i,j]
    }
}

I assume it all boils down to gather, but before spending long time trying implement various approaches, I would love to know what you folks think about this and what is the more/most suitable approach tackling this?

Isn't it strange that none of the compilers use avx-512 for this? https://godbolt.org/z/ox9nfjh8d

Why is it that gcc doesn't do register blocking? I see clang does a better job, is this common?

NULL
  • 759
  • 9
  • 18
  • 1
    If you have variable indexing in the *destination* side, that's more like a scatter, so AVX-512 only and slower than gathers, unfortunately. (Or for smaller row sizes where a whole row fits in one SIMD vector, shuffles work as a vector of *source* indices, so this wouldn't work at all for a SIMD shuffle within one vector, or vpermt2ps taking from 2 vectors. There's AFAIK no easy way to transform destination indices to source indices.) – Peter Cordes Aug 22 '21 at 04:25
  • Thanks for mentioning this, I updated the Q with more detail, take a look. – NULL Aug 22 '21 at 04:35
  • 2
    Oh, I think I was misunderstanding this. You're copying whole rows unchanged, so you're not shuffling *within* each row, you're shuffling whole rows which are larger than a single SIMD vector, so no SIMD gather, scatter, or shuffle instructions needed. – Peter Cordes Aug 22 '21 at 04:42
  • Are you wanting to do this as an in-place re-arrangement? Or only as a copy to a separate destination? As a copy, it's just a series of memcpy operations, which you (or the libc implementation) can speed up with SIMD instructions. – Peter Cordes Aug 22 '21 at 04:45
  • yes, exactly, the whole row. The matrix is often thin, like 10-100 cols, but I'm writing this as a part a of fundamental operation (step 1 of a pipeline) that many people including myself need to do over and over again, so can't really make an assumption on col size. Yet, still worth the time and effort to optimize this as much as possible, since this is the main bottleneck (96% time spent on this) – NULL Aug 22 '21 at 04:45
  • @PeterCordes it can be either or, no specific requirement. But yes, that memcpy that you mention is basically it. I'm just trying to figure out the best approach for the memcpy. Do you think the memcpy itself is best? I remember people mentioned in SO that custom SIMD memcpy can beat std memcpy. – NULL Aug 22 '21 at 04:48
  • 96% of time spent on this?!??!? That sounds insane. Are you sure it's the actual copying work, and not page faults on touching new pages if you're freeing and allocating new memory every time? What language, and what compiler / hardware? – Peter Cordes Aug 22 '21 at 04:49
  • If you really do need to do huge amounts of row permutes and don't actually read the whole matrix afterwards, consider adding a layer of indirection to everything that reads matrices so you *just* need to re-arrange a vector of row-start pointers or indices, not the data. – Peter Cordes Aug 22 '21 at 04:49
  • @PeterCordes The number of rows that it needs to deal with is at least 10K and often far larger than that. Actually, myself as a user, would need this for X of size 1 million and 10-100 cols. – NULL Aug 22 '21 at 04:50
  • 1
    glibc memcpy already uses AVX fairly efficiently, including for small copies. What you'd be possibly speeding up is startup overhead. By default it doesn't use AVX-512, though. Doing this in-place would be a whole different kettle of fish because you need to not overwrite data you haven't read yet. – Peter Cordes Aug 22 '21 at 04:51
  • Yeah, I assumed you had many rows, but that doesn't explain why 96% of your time would be spent just copying around rows of those matrices, with only <= 4% spent on number crunching to *do* anything with the data once it's re-arranged. – Peter Cordes Aug 22 '21 at 04:54
  • @PeterCordes Take a look at this Peter: https://edoliberty.github.io/papers/mailmanAlgorithm.pdf This is the method, and in genomics matrices have limited alphabet (3) so in theory one should be able to beat BLAS by about 3x. This step that consumes most of operation is applying the P matrix. – NULL Aug 22 '21 at 04:54
  • @PeterCordes I've used various implementation from scaler loop in Cpp and letting GCC taking care of stuff with opt flags, to handwritten intrinsics (naive though) to use libraries like https://xtensor.readthedocs.io/en/latest/ The thing is in mailman, applying U on the result of PX is nothing, we're basically talking about 100 micorsec, but PX is like 4 millisec. – NULL Aug 22 '21 at 05:07
  • You have an actual `P` matrix of zeros and ones that you're using with an actual matmul? Not just directly applying the `p[i]` row rearrangement vector with memcpy? (Or using that as a level of indirection so you can just modify it instead of actually copying around whole rows of data). – Peter Cordes Aug 22 '21 at 05:16
  • @PeterCordes Essentially for a matrix `M` with limited alphabet (possible values) we have a cheap decomposition, `M=UP` where U is a universal matrix containing all possible combinations of column values, and because of this it has a very special structure and never needs to be stored explicitly. P is a sparse permutation matrix with only 1 nonzero value in each column. – NULL Aug 22 '21 at 05:23
  • @PeterCordes Then performing the multiplication `R=MX` is equivalent to `R=U(PX)` which `A =PX` is basically rearranging X with possible decrease in dimension (where row i of A may be sum of multiple rows of X). Remember, P could have multiple nonzero elements in each row. Now p is simply an efficient way to store P. Sorry I had a mistake in the pseudocode, let me fix it in the question. – NULL Aug 22 '21 at 05:27
  • @PeterCordes Now, `A=PX` is taking about 4 ms while `R=UA` is only 100 microseconds thanks to the special structure of `U`. That is indeed where the saving in complexity is coming from. I just want to find a way to speed `A=PX` as it is really the major bottleneck and assuming it as a general sparse matrix is an overkill. – NULL Aug 22 '21 at 05:36
  • 1
    Ok, adding into the destination instead of copying is quite different, and isn't just a permute. Now the left-hand side is also an input, and you can't just use memcpy. You can still use SIMD loops over whole rows, though, just based on the `p[]` vector, not a `P[]` matrix. GCC/clang should even auto-vectorize a loop that does `a[i] += b[i]`, for two `int *restrict` non-overlapping args, although you might want to customize that to do better handling of start/end, especially with AVX-512 masking. – Peter Cordes Aug 22 '21 at 14:46
  • @PeterCordes Thanks Peter, I wonder why none of the compilers even bother using avx512? Take a look at the godbolt link in the Q. – NULL Aug 22 '21 at 21:14
  • 1
    `-mprefer-vector-width=256` is the default because of [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812). It would be nice if compilers use AVX-512 masking with YMM registers for the cleanup, though, instead of 128-bit and scalar after doing the last full YMM vector. – Peter Cordes Aug 22 '21 at 21:55
  • 1
    The compilers did nothing interesting because your code is incorrect. It only compiles because the language has an exotic feature called “comma operator”. You probably expecting your `X[i,j]` to index into 2D array, but that’s not what your code is doing. – Soonts Aug 24 '21 at 12:58
  • @Soonts Thanks for pointing this out, in the actual code I did use linear indexing but forgot to do it here. Updated it the code and godbolt link. – NULL Aug 26 '21 at 15:14
  • @Soonts Interesting that register blocking doesn't happen in gcc. Of course if I'm using the term register blocking correctly, been years since last time I did these, what I mean is use of multiple ymm registers etc. – NULL Aug 26 '21 at 15:18
  • 1
    @NULL On your updated link, both clang and intel are doing that for large enough matrices. They loading 4 AVX vectors (16 scalars) to 4 registers, adding another 16 scalars using memory source operands, then storing them back to memory. Clang even does that twice per iteration to save scalar instructions, one iteration handles 32 scalars = 256 bytes of data. – Soonts Aug 26 '21 at 17:35

0 Answers0