How to calculate 2x2 matrix multiplied by 2D vector using SSE intrinsics (32 bit floating points)? (C++, Mac and Windows)

Question

I need to calculate a 2D matrix multiplied with 2D vector. Both use 32 bit floats. I'm hoping to do this using SSE (any version really) for speed optimization purposes, as I'm going to be using it for realtime audio processing.

So the formula I would need is the following:

left  = L*A + R*B
right = L*C + R*D

I was thinking of reading the whole matrix from memory as a 128 bit floating point SIMD (4 x 32 bit floating points) if it makes sense. But if it's a better idea to process this in smaller pieces, then that's fine too.

L & R variables will be in their own floats when the processing begin, so they would need to be moved into the SIMD register/variable and when the calculation is done, moved back into regular variables.

The IDEs I'm hoping to get it compiled on are Xcode and Visual Studio. So I guess that'll be Clang and Microsoft's own compilers then which this would need to run properly on.

All help is welcome. Thank you in advance!

I already tried reading SSE instruction sets, but there seems to be so much content in there that it would take a very long time to find the suitable instructions and then the corresponding intrinsics to get anything working.

ADDITIONAL INFORMATION BASED ON YOUR QUESTIONS:

The L & R data comes from their own arrays of data. I have pointers to each of the two arrays (L & R) and then go through them at the same time. So the left/right audio channel data is not interleaved but have their own pointers. In other words, the data is arranged like: LLLLLLLLL RRRRRRRRRR.
Some really good points have been made in the comments about the modern compilers being able to optimize the code really well. This is especially true when multiplication is quite fast and shuffling data inside the SIMD registers might be needed: using more multiplications might still be faster than having to shuffle the data multiple times. I didn't realise that modern compilers can be that good these days. I have to experiment with Godbolt using std::array and seeing what kind of results I'll get for my particular case.
The data needs to be in 32 bit floats, as that is used all over the application. So 16 bit doesn't work for my case.

MORE INFORMATION BASED ON MY TESTS:

I used Godbolt.org to test how the compiler optimizes my code. What I found is that if I do the following, I don't get optimal code:

using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;

Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
    Vec2 result;

    result[0] = v[0]*m[0] + v[1]*m[1];
    result[1] = v[0]*m[2] + v[1]*m[3];

    return result;
}

But if I do the following, I do get quite nice code:

using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;

Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
    Vec2 result;

    result[0] = v[0]*m[0] + v[1]*m[2];
    result[1] = v[0]*m[1] + v[1]*m[3];

    return result;
}

Meaning that if I transpose the 2D matrix, the compiler seems to output pretty good results as is. I believe I should go with this method since the compiler seems to be able to handle the code nicely.

Are `L` and `R` required to come from independent variables or are you able to store them consecutively? Did you check what code clang generates with `-O3 -march=native`? — chtz, Jan 18 '23 at 14:28
In general do not try to outsmart the compiler by handwritten code, see : https://godbolt.org/z/qqzvYvMT5 it already uses those kind of instructions. You really should watch this : [CppCon 2017: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”](https://www.youtube.com/watch?v=bSkpMdDe4g4). What the compiler will do much better then you is predict pipelining behavior of the CPU — Pepijn Kramer, Jan 18 '23 at 14:40
@PepijnKramer: That's way too optimistic; compilers often don't spot the best (or even good) ways to shuffle things together (or not) for small problems, if there's any gain to be had vs. scalar. The question is talking about using intrinsics like `_mm_shuffle_ps` and `_mm_mul_ps`, not hand-written asm, so the compiler can still schedule instructions and choose addressing modes, etc. Your choice of intrinsics will influence code-gen, although compilers (especially clang) can optimize your shuffles. But your choice of scalar code will also determine dependency chains without -ffast-math. — Peter Cordes, Jan 18 '23 at 14:53
Note that the benefit of letting the compiler do the job is that it can use AVX/AVX2 when available on the target machine and if the computation can benefit of it, not to mention advanced SSE versions (eg. SSSE3 & SSE 4.1). It also make the code cleaner and more portable (eg. on ARM). In general, writting SSE/AVX code is a good idea only if you see that the compiler generate a bad code and you cannot help him. Otherwise, this is a *premature optimization*. — Jérôme Richard, Jan 18 '23 at 15:00
e.g. you might broadcast L and R and do `[A B C D] * L`, `[A B C D] * R`, and shuffle to get `[B*R B*R D*R D*R]` with SSE3 `movshdup` which can copy-and-shuffle. Then `_mm_add_ps` to get `[L*A+R*B (garbage) L*C+R*D (garbage) ]`, which has one of your scalar outputs in the low element (i.e. this XMM register already is a valid scalar float). With one more `movhps`, you can get the LC+RD value to the bottom of another register, as another scalar float. — Peter Cordes, Jan 18 '23 at 15:02
Now is that better than 4x `mulss` + 2x `addss`? Maybe not, especially on Intel CPUs where shuffle throughput is likely to be a bottleneck. Especially if you can't reuse the broadcasted L and R. Maybe one broadcast, e.g. of L, and one mulps + movhps to get the LA and LC values, but scalar for RB and RD? Can you preprocess the matrix? `movshdup` on `shufps` on it would allow two parallel mulps to leave values in the right places. Or maybe unpack L and R together with `shufps` to make `[L L R R]` to use with a pre-shuffled matrix? But then you need multiple shuffles. — Peter Cordes, Jan 18 '23 at 15:02
@PeterCordes I guess you are right, but still I would not revert to hand coding (like Jerome says) unless I had a real performance issue. And then I would still risk optimizing for one specific machine configuration. — Pepijn Kramer, Jan 18 '23 at 15:08
Do you need 32-bit floats for this? If 16-bit fixed-point is OK, that enables a technique based on PMADDWD which is nice. — harold, Jan 18 '23 at 15:17
I just noticed you said this was for audio processing. Are you doing this for a large number of Left and Right sample with the same matrix? Are they stored interleaved in an array of LRLRLR etc.? Or "planar" like LLLL and RRRR? If the latter, you can do 4 at once very efficiently, multiplying with _mm_set1_ps(A), _mm_set1_ps(B), etc. Or if available, `_mm256_fma_ps` to remove the need for the separate adds. If they're interleaved, you could `_mm_shuffle_ps` 2 vectors 2 different ways to get LLLL and RRRR. — Peter Cordes, Jan 18 '23 at 15:23
You might want to check out [this post](https://lucisqr.substack.com/p/icelakeavx2-skylakeavx-or-genericx86) on why perhaps you don't want to hardcode your assembly. — Something Something, Jan 18 '23 at 16:25
I've added new information based on your questions and comments. — user19179144, Jan 19 '23 at 09:06
@PeterCordes: `LRLRLR` is common for stereo. In general, for multi-channel audio such as 5.1 and 7.1 all channels are stored interleaved. That of course argues for a transformation to independent channels when reading these representations. — MSalters, Jan 19 '23 at 10:59

Something Something · Answer 1 · 2023-01-18T17:35:23.283

2

You are better leave the assembly generation to your compiler. It is doing a great job from what I could gather.

Additionally, GCC and CLANG (not sure about the others) have extension attributes that allow you to compile code for several different architectures, one of which will be picked at runtime.

For example, consider the following code:

using Vector = std::array<float, 2>;

using Matrix = std::array<float, 4>;

namespace detail {
Vector multiply(const Matrix& m, const Vector& v) {
    Vector r;
    r[0] = v[0] * m[0] + v[1] * m[2];
    r[1] = v[0] * m[1] + v[1] * m[3];
    return r;
}
}  // namespace detail

__attribute__((target("default"))) 
Vector multiply(const Matrix& m, const Vector& v) {
    return detail::multiply(m, v);
}

__attribute__((target("avx"))) 
Vector multiply(const Matrix& m, const Vector& v) {
    return detail::multiply(m, v);
}

Assume that you compile it with

g++ -O3 -march=x86-64 main.cpp -o main

For AVX it creates perfectly optimized AVX SIMD code

multiply(Matrix const&, Vector const&) [clone .avx]:      # @multiply(Matrix const&, Vector const&) [clone .avx]
        vmovsd  (%rdi), %xmm0                   # xmm0 = mem[0],zero
        vmovsd  8(%rdi), %xmm1                  # xmm1 = mem[0],zero
        vbroadcastss    4(%rsi), %xmm2
        vmulps  %xmm1, %xmm2, %xmm1
        vbroadcastss    (%rsi), %xmm2
        vmulps  %xmm0, %xmm2, %xmm0
        vaddps  %xmm1, %xmm0, %xmm0
        retq

While the default implementation uses SSE instructions only:

multiply(Matrix const&, Vector const&):          # @multiply(Matrix const&, Vector const&)
        movsd   (%rdi), %xmm1                   # xmm1 = mem[0],zero
        movsd   8(%rdi), %xmm2                  # xmm2 = mem[0],zero
        movss   (%rsi), %xmm0                   # xmm0 = mem[0],zero,zero,zero
        movss   4(%rsi), %xmm3                  # xmm3 = mem[0],zero,zero,zero
        shufps  $0, %xmm3, %xmm3                # xmm3 = xmm3[0,0,0,0]
        mulps   %xmm2, %xmm3
        shufps  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]
        mulps   %xmm1, %xmm0
        addps   %xmm3, %xmm0
        retq

You might want to check out this post

Godbolt link: https://godbolt.org/z/fcKvchvcb

edited Jan 18 '23 at 17:35

answered Jan 18 '23 at 16:37

Something Something

3,999
1
6
21

1

Your second version is also using AVX1 SIMD instructions, since you told it `target("sse,avx")`. Note the `vbroadcastss` load which isn't available as a single instruction with SSE; it would need a `movss` load and a `shufps $0, %xmm0, %xmm0` or similar. – Peter Cordes Jan 18 '23 at 17:18
1

What compiler is this from? I'm guessing clang; it's usually the best at inventing shuffles to auto-vectorize non-looping code. (vs. GCC or MSVC) – Peter Cordes Jan 18 '23 at 17:21
The entire batch is compiled with "-march=x86-64" so the "default" is SSE-only. The "avx,sse" version does use AVX. – Something Something Jan 18 '23 at 17:22
Yes, clang trunk but gcc produces similar instructions. Apparently clang deals better with port pressure in this case though. – Something Something Jan 18 '23 at 17:24
Oh interesting, this avoids any shuffles by just loading the matrix in two halves. Duh, that makes sense. It costs one extra `mulps` but saves a ton of shuffling. I knew none of the ideas I was coming up with in comments under the question seemed that good. This does only do one L,R pair at once; you'd want different asm if you have an array of L and R values. (The compiler might do a good job with the same C++ source, but you'd have to check.) – Peter Cordes Jan 18 '23 at 17:24
https://godbolt.org/z/46MKGfabs – Something Something Jan 18 '23 at 17:26
If the "default" you're talking about isn't one of the two code blocks you show, don't use a : in the text to introduce it like that's what the text is talking about. *While the default implementation uses SSE instructions only: `code`* strongly associates that text with that code block, as a statement about it. – Peter Cordes Jan 18 '23 at 17:26
But it does. The "default" in the comments relate to the "default" in the attribute. – Something Something Jan 18 '23 at 17:28
Also, your first block of asm isn't just AVX, it's using AVX+FMA3, e.g. at least `-march=bdver2` or `-march=haswell`. – Peter Cordes Jan 18 '23 at 17:28
*But it does. The "default" in the comments relate to the "default" in the attribute.* Then like I said, the text is wrong. The code block following (and associated with) the sentence that says is uses only SSE is actually using AVX1 instructions. – Peter Cordes Jan 18 '23 at 17:29
1

I just noticed those both had [clone .skylake] or [clone .avx]; so you were using function-multiversioning with ifunc dispatching, I think. Seems fixed now with your edit. – Peter Cordes Jan 18 '23 at 17:31
hmm no, both movss, shufps, mulps are all SSE. They have AVX extended versions though but they are not used. – Something Something Jan 18 '23 at 17:32
Yep, it took a while, I made a lot of typos. – Something Something Jan 18 '23 at 17:34
"hmm no, both movss, shufps, mulps are all SSE" - yes, and you only edited your answer to use them instead of `vmovss` / `vmulps` etc. *after* my comments, before posting that comment. Anyway, fixed now like I said (in an edit to the comment before that, which maybe you didn't see until after posting.) – Peter Cordes Jan 18 '23 at 17:37
Dont assume malice. I was fixing things while you were commenting. We just got out of sync. My intention is always to fix the post wrt good comments. – Something Something Jan 18 '23 at 17:38
1

Godbolt confirms that GCC doesn't do as well as clang, using `vmovshdup` / `sldup` instead of broadcast-loads. Your answer is now claiming that GCC generated the code in your code blocks, when in fact it's clang. (GCC uses `movq` for 64-bit XMM loads, clang uses `movsd` on Godbolt; https://godbolt.org/z/j6eMKfe3x shows both). That would only be true on a Mac with the default setup where `g++` is actually `clang++` in disguise, and even then it's not a useful way to describe what compiler you used. – Peter Cordes Jan 18 '23 at 17:40

score 1 · Answer 2 · answered Jan 19 '23 at 11:09

It's much better to let compilers vectorize over the whole arrays of LLLL and RRRR samples, not for one left, right sample pair at once.

With the same mixing matrix for a whole array of audio samples, you get nice asm with no shuffles. Borrowing code from Nole's answer just to illustrate the auto-vectorization (you might want to simplify the

struct Vector {
    std::array<float,2> coef;
};

struct Matrix {
    std::array<float,4> coef;
};

    static Vector multiply( const Matrix& m, const Vector& v ) {
        Vector r;
        r.coef[0]  = v.coef[0]*m.coef[0] + v.coef[1]*m.coef[2];
        r.coef[1] = v.coef[0]*m.coef[1] + v.coef[1]*m.coef[3];
        return r;
    }


// The per-element functions need to inline into the loop,
// so target attributes need to match or be a superset.  
// Or better, just don't use target options on the per-sample function
__attribute__ ((target ("avx,fma")))
void intermix(float *__restrict left, float *__restrict right, const Matrix &m)
{
    for (int i=0 ; i<10240 ; i++){
        Vector v = {left[i], right[i]};
        v = multiply(m, v);
        left[i] = v.coef[0];
        right[i] = v.coef[1];
    }
}

GCC -O3 (without any target options) compiles this to nice AVX1 + FMA code, as per the __attribute__((target("avx,fma"))) (Similar to -march=x86-64-v3). (Godbolt)

# GCC (trunk) -O3
intermix(float*, float*, Matrix const&):
        vbroadcastss    (%rdx), %ymm5
        vbroadcastss    8(%rdx), %ymm4
        xorl    %eax, %eax
        vbroadcastss    4(%rdx), %ymm3
        vbroadcastss    12(%rdx), %ymm2       # broadcast each matrix element separately
.L2:
        vmulps  (%rsi,%rax), %ymm4, %ymm1     # a whole vector of 8 R*B
        vmulps  (%rsi,%rax), %ymm2, %ymm0
        vfmadd231ps     (%rdi,%rax), %ymm5, %ymm1
        vfmadd231ps     (%rdi,%rax), %ymm3, %ymm0
        vmovups %ymm1, (%rdi,%rax)
        vmovups %ymm0, (%rsi,%rax)
        addq    $32, %rax
        cmpq    $40960, %rax
        jne     .L2
        vzeroupper
        ret
main:
        movl    $13, %eax
        ret

Note how there are zero shuffle instructions because the matrix coefficients each get broadcast to a separate vector, so one vmulps can do R*B for 8 R samples in parallel, and so on.

Unfortunately GCC and clang both use indexed addressing modes, so the memory source vmulps and vfma instructions un-laminate into 2 uops for the back-end on Intel CPUs. And the stores can't use the port 7 AGU on HSW/SKL. -march=skylake or any other specific Intel SnB-family uarch doesn't fix that for either of them. Clang unrolls by default, so the extra pointer increments to avoid indexed addressing modes would be amortized. (It would actually just be 1 extra add instruction, since we're modifying L and R in-place. You could of course change the function to copy-and-mix.)

If the data is hot in L1d cache, it'll bottleneck on the front-end rather than load+FP throughput, but it still comes relatively close to 2 loads and 2 FMAs per clock.

Hmm, GCC is saving instructions but costing extra loads by loading the same L and R data twice, as memory-source operands for vmulps and vfmadd...ps. With -march=skylake it doesn't do that, instead using a separate vmovups (which has no problem with an indexed addressing mode, but the later store still does.)

I haven't looked at tuning choices from other GCC versions.

# GCC (trunk) -O3 -march=skylake   (which implies -mtune=skylake)
.L2:
        vmovups (%rsi,%rax), %ymm1
        vmovups (%rdi,%rax), %ymm0     # separate loads
        vmulps  %ymm1, %ymm5, %ymm2    # FP instructions using only registers
        vmulps  %ymm1, %ymm3, %ymm1
        vfmadd231ps     %ymm0, %ymm6, %ymm2
        vfmadd132ps     %ymm4, %ymm1, %ymm0
        vmovups %ymm2, (%rdi,%rax)
        vmovups %ymm0, (%rsi,%rax)
        addq    $32, %rax
        cmpq    $40960, %rax
        jne     .L2

This is 10 uops, so can issue 2 cycles per iteration on Ice Lake, 2.5c on Skylake. On Ice Lake, it will sustain 2x 256-bit mul/FMA per clock cycle.

On Skylake, it doesn't bottleneck on AGU throughput since it's 4 uops for ports 2,3 every 2.5 cycles. So that's fine. No need for indexed addressing modes.

How to calculate 2x2 matrix multiplied by 2D vector using SSE intrinsics (32 bit floating points)? (C++, Mac and Windows)

2 Answers2