Vectorization of sin and cos

Question

I was playing around with Compiler Explorer and ran into an anomaly (I think). If I want to make the compiler vectorize a sin calculation using libmvec, I would write:

#include <cmath>

#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;

inline T s(const T x)
{
  return sinf(x);
}

void func(AT* __restrict x, AT* __restrict y, int length)
{
  if (length & NN-1) __builtin_unreachable();
  for (int i = 0; i < length; i++)
  {
    y[i] = s(x[i]);
  }
}

compile with gcc 6.2 and -O3 -march=native -ffast-math and get

func(float*, float*, int):
        testl   %edx, %edx
        jle     .L10
        leaq    8(%rsp), %r10
        andq    $-32, %rsp
        pushq   -8(%r10)
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r14
        xorl    %r14d, %r14d
        pushq   %r13
        leal    -8(%rdx), %r13d
        pushq   %r12
        shrl    $3, %r13d
        movq    %rsi, %r12
        pushq   %r10
        addl    $1, %r13d
        pushq   %rbx
        movq    %rdi, %rbx
        subq    $8, %rsp
.L4:
        vmovaps (%rbx), %ymm0
        addl    $1, %r14d
        addq    $32, %r12
        addq    $32, %rbx
        call    _ZGVcN8v_sinf      // YAY! Vectorized trig!
        vmovaps %ymm0, -32(%r12)
        cmpl    %r13d, %r14d
        jb      .L4
        vzeroupper
        addq    $8, %rsp
        popq    %rbx
        popq    %r10
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %rbp
        leaq    -8(%r10), %rsp
.L10:
        ret

But when I add a cosine to the function, there is no vectorization:

#include <cmath>

#define NN 512
typedef float T;
typedef T __attribute__((aligned(NN))) AT;

inline T f(const T x)
{
  return cosf(x)+sinf(x);
}

void func(AT* __restrict x, AT* __restrict y, int length)
{
  if (length & NN-1) __builtin_unreachable();
  for (int i = 0; i < length; i++)
  {
    y[i] = f(x[i]);
  }
}

which gives:

func(float*, float*, int):
        testl   %edx, %edx
        jle     .L10
        pushq   %r12
        leal    -1(%rdx), %eax
        pushq   %rbp
        leaq    4(%rdi,%rax,4), %r12
        movq    %rsi, %rbp
        pushq   %rbx
        movq    %rdi, %rbx
        subq    $16, %rsp
.L4:
        vmovss  (%rbx), %xmm0
        leaq    8(%rsp), %rsi
        addq    $4, %rbx
        addq    $4, %rbp
        leaq    12(%rsp), %rdi
        call    sincosf               // No vectorization
        vmovss  12(%rsp), %xmm0
        vaddss  8(%rsp), %xmm0, %xmm0
        vmovss  %xmm0, -4(%rbp)
        cmpq    %rbx, %r12
        jne     .L4
        addq    $16, %rsp
        popq    %rbx
        popq    %rbp
        popq    %r12
.L10:
        ret

I see two good alternatives. Either call a vectorized version of sincosf or call the vectorized sin and cos sequentially. I tried adding -fno-builtin-sincos to no avail. -fopt-info-vec-missed complains about complex float, which there is none.

Is this a known issue with gcc? Either way, is there a way I can convince gcc to vectorize the latter example?

(As an aside, is there any way to get gcc < 6 to vectorize trigonometric functions automatically?)

If you have the function for sin, could you not write the one for cos by changing x to pi/2 -x, expanding pi/2 to a vector if needed? — Jeremy Kahan, Sep 20 '16 at 11:39
Also if you are actually after sin plus cos, you could use sqrt(2)sin(x plus pi/4) — Jeremy Kahan, Sep 20 '16 at 11:42
I'm not actually interested in sin+cos. I am kind of interested in exp(complex). But the question was meant to be more general than a single use case. In the end, I would like to be able to rely on multiple compilers (nvcc, gcc, cl) to compile the same code efficiently without writing the optimizations manually. — Avi Ginsburg, Sep 20 '16 at 11:47
@JeremyKahan And yes, I could write `sin(x+c)`, I just don't want to. The code remains easier to read when we say what we mean. In the end, that's the real crux of the issue. — Avi Ginsburg, Sep 20 '16 at 11:54
hello, I always wondered what was the name of "__" in code, need to know the name to search for that. — Giorgi Gvimradze, Apr 23 '19 at 19:29
how do you know that `_ZGVcN8v_sinf` is a vectorized function and `sincosf` is not? — phuclv, May 03 '19 at 16:46
@phuclv: because `sincosf` is the standard libm function, and `_ZGVcN8v_sinf` is how libmvec mangles SIMD type names: https://sourceware.org/glibc/wiki/libmvec mentions `_ZGVdN4v_cos` as AVX2 `cos()` (of 4x `double`). — Peter Cordes, May 03 '19 at 19:14
@phuclv: more obviously, look at the surrounding asm using `ymm0` as a return value vs. the low element of `xmm0`. Anyway yeah, still a missed optimization with current gcc9.1. Using either `sinf()` or `cosf()` alone vectorizes, so it would certainly be possible to spill a vector of `sinf` results across a call to a vectorized `cosf`. IDK if maybe there's no vectorized `sincosf` and that optimization (of combining sin and cos into sincos) happens first, defeating vectorization. — Peter Cordes, May 04 '19 at 03:27
And `sincosf` returning by reference is a stupid limitation imposed by C only allowing functions to return 1 value; a vectorized `sincosf` (if/when one exists) should hopefully just return in ymm0 and ymm1. There should be a scalar one that returns in registers, too. ISO C's `sincos()`'s design of returning *both* values by reference is also dumb; at least return one of them by value. — Peter Cordes, May 04 '19 at 03:29
ICC vectorizes this with SVML `__svml_sincosf8_l9`. https://gcc.godbolt.org/z/lvKJsJ. IIRC, you can tell gcc that SVML is available, if you have it installed. Clang vectorizes around unpacking to scalar for `sincosf`, which is terrible. (And also doesn't properly vectorize plain `sinf()`, so maybe it needs options to tell it that libmvec is available.) Probably worth investigating. — Peter Cordes, May 04 '19 at 03:34
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70901 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40770) — Marc Glisse, May 04 '19 at 21:22

Vectorization of sin and cos

0 Answers0

Linked