6

I have a simple loop with takes the product of n complex numbers. As I perform this loop millions of times I want it to be as fast as possible. I understand that it's possible to do this quickly using SSE3 and gcc intrinsics like _mm_addsub_ps but I'm interested in whether it's possible to get gcc to auto-vectorize code like this, a product of complex numbers:

#include <complex.h>
complex float f(complex float x[], int n ) {
  complex float p = 1.0;
  for (int i = 0; i < n; i++)
    p *= x[i];
  return p;
}

The assembly you get from gcc -S -O3 -ffast-math is:

        .file   "test.c"
        .section        .text.unlikely,"ax",@progbits
.LCOLDB2:
        .text
.LHOTB2:
        .p2align 4,,15
        .globl  f
        .type   f, @function
f:
.LFB0:
        .cfi_startproc
        testl   %esi, %esi
        jle     .L4
        leal    -1(%rsi), %eax
        pxor    %xmm2, %xmm2
        movss   .LC1(%rip), %xmm3
        leaq    8(%rdi,%rax,8), %rax
        .p2align 4,,10
        .p2align 3
.L3:
        movaps  %xmm3, %xmm5
        movaps  %xmm3, %xmm4
        movss   (%rdi), %xmm0
        addq    $8, %rdi
        movss   -4(%rdi), %xmm1
        mulss   %xmm0, %xmm5
        mulss   %xmm1, %xmm4
        cmpq    %rdi, %rax
        mulss   %xmm2, %xmm0
        mulss   %xmm2, %xmm1
        movaps  %xmm5, %xmm3
        movaps  %xmm4, %xmm2
        subss   %xmm1, %xmm3
        addss   %xmm0, %xmm2
        jne     .L3
        movaps  %xmm2, %xmm1
.L2:
        movss   %xmm3, -8(%rsp)
        movss   %xmm1, -4(%rsp)
        movq    -8(%rsp), %xmm0
        ret
.L4:
        movss   .LC1(%rip), %xmm3
        pxor    %xmm1, %xmm1
        jmp     .L2
        .cfi_endproc
.LFE0:
        .size   f, .-f
        .section        .text.unlikely
.LCOLDE2:
        .text
.LHOTE2:
        .section        .rodata.cst4,"aM",@progbits,4
        .align 4
.LC1:
        .long   1065353216
        .ident  "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609"
        .section        .note.GNU-stack,"",@progbits
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Simd
  • 19,447
  • 42
  • 136
  • 271

2 Answers2

3

The problem is that the complex type is not SIMD friendly. I have never been a fan of the complex type because it's a composite object that usually does not map to a primitive type or single operation in hardware (certainly not with x86 hardware).

In order to make complex arithmetic SIMD friendly you need to operate on multiple complex numbers simultaneous. For SSE you need to operate on four complex numbers at once.

We can use GCC's vector extensions to make the syntax easier.

typedef float v4sf __attribute__ ((vector_size (16)));

Then we can delcare a union of an array and the vector extension

typedef union {
  v4sf v;
  float e[4];
} float4

And lastly we define a block of four complex numbers like this

typedef struct {
  float4 x;
  float4 y;
} complex4;

where x is four real parts and y is four imaginary components.

Once we have this we can multiple 4 complex numbers at once like this

static complex4 complex4_mul(complex4 a, complex4 b) {
  return (complex4){a.x.v*b.x.v -a.y.v*b.y.v, a.y.v*b.x.v + a.x.v*b.y.v};
}

and finally we get to your function modified to operate on four complex numbers at a time.

complex4 f4(complex4 x[], int n) {
  v4sf one = {1,1,1,1};
  complex4 p = {one,one};
  for (int i = 0; i < n; i++) p = complex4_mul(p, x[i]);
  return p;
}

Let's look at the assembly (Intel syntax) to see if it's optimal

.L3:
    movaps  xmm4, XMMWORD PTR [rsi]
    add     rsi, 32
    movaps  xmm1, XMMWORD PTR -16[rsi]
    cmp     rdx, rsi
    movaps  xmm2, xmm4
    movaps  xmm5, xmm1
    mulps   xmm1, xmm3
    mulps   xmm2, xmm3
    mulps   xmm5, xmm0
    mulps   xmm0, xmm4
    subps   xmm2, xmm5
    addps   xmm0, xmm1
    movaps  xmm3, xmm2
    jne     .L3

That's exactly four 4-wide multiplications, one 4-wide addition, and one 4-wide subtraction. The variable p stays in register and only the array x is loaded from memory just like we want.

Let's look at the algebra for the product of complex numbers

{a, bi}*{c, di} = {(ac - bd),(bc + ad)i}

That's exactly four multiplications, one addition, and one subtraction.

As I explained in this answer efficient SIMD algebraically is often identical to the scalar arithmetic. So we have replaced four 1-wide multiplications, addition, and subtraction, with four 4-wide multiplications, addition, and subtraction. That's the best you can do with 4-wide SIMD: four for the price of one.

Note that this does not need any instructions beyond SSE and no additional SSE instructions (except for FMA4) will be any better. So on a 64-bit system you can compile with -O3.

It is trivial to extend this for 8-wide SIMD with AVX.

One major advantage of using GCC's vector extensions is you get FMA without any additional effort. E.g. if you compile with -O3 -mfma4 the main loop is

.L3:
    vmovaps xmm0, XMMWORD PTR 16[rsi]
    add     rsi, 32
    vmulps  xmm1, xmm0, xmm2
    vmulps  xmm0, xmm0, xmm3
    vfmsubps        xmm1, xmm3, XMMWORD PTR -32[rsi], xmm1
    vmovaps xmm3, xmm1
    vfmaddps        xmm2, xmm2, XMMWORD PTR -32[rsi], xmm0
    cmp     rdx, rsi
    jne     .L3
Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Thank you for this. Is is possible to exploit the fact that you only need three multiplications to multiply two complex numbers? See e.g. http://math.stackexchange.com/q/465070/66307 – Simd Jan 03 '17 at 10:15
  • 1
    @eleanora, you could do that but I don't think you would want to. It uses one less multiplication but three more additions/subtractions. SSE and AVX multiplication is as fast as addition/subtraction more or less (and on Haswell addition has half the throughput). I think it's also less amenable to FMA operations. A long time ago floating point multiplication was much slower than addition/subtraction. – Z boson Jan 03 '17 at 10:52
  • 1
    Thanks again. As soon as I have time in a few days I will time your answer versus what `gcc -march=native -O3 -ffast-math` gives for the code in the question. – Simd Jan 03 '17 at 10:58
  • @eleanora, if you're running over large arrays then your operation is probably memory bandwidth bound anyway. – Z boson Jan 03 '17 at 11:09
  • That's a good point. In practice I am running over arrays of length <40 but just 2^35 times. – Simd Jan 03 '17 at 11:15
  • 1
    @eleanora, when you use my method make sure the arrays are a multiple of 4. You can pad the array with a few extra elements if you need to. – Z boson Jan 03 '17 at 11:46
  • Could this be made even faster by using AVX? I wasn't sure if that was what you meant by "It is trivial to extend this for 8-wide SIMD with AVX. " – Simd Jan 08 '17 at 20:33
  • 1
    @eleanora, yes, it could be two times faster with AVX than with SSE if it's not memory bandwidth bound. BTW there is a small error. In `complex4 p = {one,one};` it sets both the real and imaginary component to 1 but in your function you only set the real part to one. Create a zero vect `v4sf zero = {0,0,0,0};` and then do `complex4 p = {one,zero};` to fix it. – Z boson Jan 08 '17 at 23:12
  • 1
    Note that FMA4 is AMD-only, from Bulldozer through Excavator, and undocumented but works on Zen1. Normally you'd use FMA3, `-march=haswell` or `-march=bdver2` or later imply `-mfma`. If you do only care about those generations of AMD CPUs, you can sometimes save some `vmovaps` register-copying with FMA4, though. – Peter Cordes Jan 08 '23 at 07:47
0

I am not an assembly expert but I have managed following. I would have commented but it is too big:

cat test.s
    .file   "test.c"
    .text
    .p2align 4,,15
    .globl  f
    .type   f, @function
f:
.LFB0:
    .cfi_startproc
    testl   %esi, %esi
    jle     .L4
    leal    -1(%rsi), %eax
    pxor    %xmm0, %xmm0
    movss   .LC1(%rip), %xmm1
    leaq    8(%rdi,%rax,8), %rax
    .p2align 4,,10
    .p2align 3
.L3:
    movaps  %xmm1, %xmm4
    movss   (%rdi), %xmm3
    movss   4(%rdi), %xmm2
    mulss   %xmm3, %xmm1
    mulss   %xmm2, %xmm4
    addq    $8, %rdi
    mulss   %xmm0, %xmm2
    cmpq    %rdi, %rax
    mulss   %xmm3, %xmm0
    subss   %xmm2, %xmm1
    addss   %xmm4, %xmm0
    jne     .L3
.L1:
    movss   %xmm1, -8(%rsp)
    movss   %xmm0, -4(%rsp)
    movq    -8(%rsp), %xmm0
    ret
.L4:
    movss   .LC1(%rip), %xmm1
    pxor    %xmm0, %xmm0
    jmp     .L1
    .cfi_endproc
.LFE0:
    .size   f, .-f
    .section        .rodata.cst4,"aM",@progbits,4
    .align 4
.LC1:
    .long   1065353216
    .ident  "GCC: (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005"
    .section        .note.GNU-stack,"",@progbits

My compilation command was gcc -S -O3 -ffast-math -ftree-vectorizer-verbose=3 -ftree-slp-vectorize -ftree-vectorize -msse3 test.c you do not need all of them as few gets enabled at -O3. Refer to https://gcc.gnu.org/projects/tree-ssa/vectorization.html

While I do not have an answer I have tried to help. When I specify my cpu architecture(build) as well I get following:

    .file   "test.c"
    .text
    .p2align 4,,15
    .globl  f
    .type   f, @function
f:
.LFB0:
    .cfi_startproc
    testl   %esi, %esi
    jle     .L4
    vmovss  .LC1(%rip), %xmm1
    leal    -1(%rsi), %eax
    vxorps  %xmm0, %xmm0, %xmm0
    leaq    8(%rdi,%rax,8), %rax
    .p2align 4,,10
    .p2align 3
.L3:
    vmovss  (%rdi), %xmm2
    vmovss  4(%rdi), %xmm3
    addq    $8, %rdi
    vmulss  %xmm3, %xmm0, %xmm4
    vmulss  %xmm2, %xmm0, %xmm0
    vfmadd231ss     %xmm3, %xmm1, %xmm0
    vfmsub132ss     %xmm2, %xmm4, %xmm1
    cmpq    %rdi, %rax
    jne     .L3
.L1:
    vmovss  %xmm1, -8(%rsp)
    vmovss  %xmm0, -4(%rsp)
    vmovq   -8(%rsp), %xmm0
    ret
.L4:
    vmovss  .LC1(%rip), %xmm1
    vxorps  %xmm0, %xmm0, %xmm0
    jmp     .L1
    .cfi_endproc
.LFE0:
    .size   f, .-f
    .section        .rodata.cst4,"aM",@progbits,4
    .align 4
.LC1:
    .long   1065353216
    .ident  "GCC: (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005"
    .section        .note.GNU-stack,"",@progbits

The command now is gcc -S -O3 -ffast-math -msse4 -march=haswell test.c where haswell is my i7 4770HQ cpu. Refer this for your cpu.

So as you see the AVX instruction set come in picture in the second version.

A sample benchmark for following code:

$time ./a.out 
0.000000
real    0m0.684s
user    0m0.620s
sys     0m0.060s


#include <stdio.h>
#include <complex.h>
complex float f(complex float x[], long n ) {
  complex float p = 1.0;
  for (long i = 0; i < n; i++)
    p *= x[i];
  return p;
}

int main()
{
    static complex float x[200000000] = {0.0, 1.0, 2.0, 4.0, 5.0, 6.0};
    complex float p = f(x, 200000000);

    printf("%f", creal(p));

    return 0;
}

The array is static so most of it is on disk i.e. ssd hard drive. You can allocate it in memory for even faster processing. This is 200M loops. Binary is 1.5G so most of the time is IO. CPU is blazing it even without -msse3 and -march. All you need is -ffast-math. That is causing a big difference.

I changed the program to following:

#include <stdio.h>
#include <complex.h>
float f(float x[], long n ) {
    float p = 1.0;
    for (long i = 0; i < 8; i++) {
        p = p * x[i];
    }
    return p;
}

int main() {
    float x[8] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};

    printf("%f\n", f(x, 8));

    return 0;
}

and compiled with gcc -S -O3 -ffast-math -msse3 -mfpmath=sse -mavx -march=haswell test.c which results in:

f:
.LFB23:
    .cfi_startproc
    vmovups (%rdi), %ymm2
    vxorps  %xmm1, %xmm1, %xmm1
    vperm2f128      $33, %ymm1, %ymm2, %ymm0
    vmulps  %ymm2, %ymm0, %ymm0
    vperm2f128      $33, %ymm1, %ymm0, %ymm2
    vshufps $78, %ymm2, %ymm0, %ymm2
    vmulps  %ymm2, %ymm0, %ymm0
    vperm2f128      $33, %ymm1, %ymm0, %ymm1
    vpalignr        $4, %ymm0, %ymm1, %ymm1
    vmulps  %ymm1, %ymm0, %ymm0
    vzeroupper
    ret
    .cfi_endproc

So what appears to me is that to force gcc to use SSE3 you node to code in a certain way. http://sci.tuomastonteri.fi/programming/sse will be useful to you.

Final notes: If you experiment with different values of upper limit for i you will see that different instructions are produced. I think the reason for this is that gcc does not evaluate variable so you might want to use C++ templates which are capable of compile time calculations and do it.

Shiv
  • 1,912
  • 1
  • 15
  • 21
  • The output with -march=haswell is interest. I believe the canonical SSE3 code for this problem should use mulps, shufps and addsubps. – Simd Jan 01 '17 at 17:50
  • These were introduce in Haswell microarchitecture. https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions – Shiv Jan 01 '17 at 17:51
  • @eleanora you have what you wanted. – Shiv Jan 01 '17 at 18:26
  • If the comparison to i in for loop is to a variable then it will not produce SSe3 instruction set. I do not know why. Also if that is not multiple of 4 even then it will produce AVX instruction and not SSE3. Hope it helps. – Shiv Jan 01 '17 at 18:28
  • @eleanora I am sorry but I am a noob and took such a long time. An expert would have told you straight up about it. – Shiv Jan 01 '17 at 18:29
  • 1
    I don't see any evidence of vectorization in the assembly you posted. It's all scalar arithmetic. – Z boson Jan 03 '17 at 09:59
  • vmulps are instructions of SSE3 instruction set. – Shiv Jan 03 '17 at 10:13
  • 2
    `vmulps` is the AVX VEX encoded version of the SSE instruction `mulps` (not even SSE2). The vectorized code at the end of your answer has nothing to do with `complex`. It's simply `float` which is trivial to vectorize i.e. it does not address the OP's question which was about vectorizing `complex`. – Z boson Jan 04 '17 at 08:19
  • `addsubps` is SSE3. `vaddsubps` is AVX1, new in Sandybridge. GCC's bad auto-vectorization of the `float` reduction with `vpalignr ymm` uses AVX2, new in Haswell. It's not `complex float`, though. FMA instructions like `vfmsubss` (scalar mul-sub) were new in Haswell, and are quite useful for scalar complex multiply. – Peter Cordes Jan 08 '23 at 07:43