5

I'm trying to write a C++ program, which launches a function I write in x64 assembler. I'd like to speed things up a little (and play with CPU features), so I chose to use vector operations.

The problem is, I have to multiply sines by an integer, so I have to calculate the sines first. Is it possible to do this in SSE/AVX? I'm aware of instruction fsin, but not only it is in FPU, but also it calculates only 1 sine at once. So I'd have to push it in FPU, call fsin, pop it from FPU to memory, and then put it in AVX register. It seems to me it's not worth the hassle.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Chmielok
  • 150
  • 1
  • 7

4 Answers4

4

Yes, there is a vector version using SSE/AVX! But the catch is that Intel C++ compiler must be used.

This is called Intel small vector math library (intrinsics):

for 128bit SSE please use (double precision): _mm_sin_pd

for 256bit AVX please use (double precision): _mm256_sin_pd

The two intrinsics are actually very small functions consists of hand written SSE/AVX assemblies, and now you can process 4 sine calculations at once by using AVX :=) the latency is about ~10 clock cycles (if I remember correctly) on Haswell CPU.

By the way, the CPU needs to execute about 100 such intrinsics to warm up and reach its peak performance, if there is only a few sin functions needs to be evaluated, it's better to use plain sin() instead.

Good luck!!

PhD AP EcE
  • 3,751
  • 2
  • 17
  • 15
2

Since the vectorized sin/cos extensions are required by OpenMP 4.0, gcc-glibc offers them in libmvec as well. See:

For a list of other SVML alternatives, see https://stackoverflow.com/a/36637424.

Mingye Wang
  • 1,107
  • 9
  • 32
1

There is no sine instruction in SSE/AVX, however depending on the precision you require you can write an approximation to the sine function either as a polynomial using Taylor/Madhava series or as the quotient of two polynomials using Pade Approximant. And of course many more polynomial approximation techniques.

Whether this yields the precision you want and how fast this method is depends on your exact problem. Generally speaking polynomial approximation is very fast as one can evaluate an n'th degree polynomial using n FMA instructions (The Pade approximant also requires one division) by writing it in the form of

a+x*(b+x*(c+x*(...))).

However sines are notoriously ill behaved when approximated using polynomials so the use cases are limited.

Deleted
  • 29
  • 3
  • Please check intel SVML – PhD AP EcE Jan 19 '15 at 08:36
  • 2
    Yes, SVML is one way of doing things, but it is just a library masquerading as an instruction set. I decided to answer the question based on the assumption that the OP wanted to solve this problem using just assembly instructions, be it for fun or educational purposes :) – Deleted Jan 19 '15 at 14:49
  • 1
    Yes, I agree that educational purposes is fun and important, but more importantly, the author is looking for a solution: "Is it possible to get multiple sines in AVX/SSE?". – PhD AP EcE Jan 19 '15 at 16:27
  • But I agree that there is no direct AVX hardware implementation of sin(). SVML provides a way to utilize AVX instructions to parallel process many common math functions in a SIMD fashion. =) – PhD AP EcE Jan 19 '15 at 16:36
  • "sines are notoriously ill behaved when approximated using polynomials" any source? Many libraries do just that, e.g. https://github.com/microsoft/DirectXMath/blob/549b51d9cc328903739d5fbe7f348aec00c83339/Inc/DirectXMathMisc.inl#L2170-L2200 – Soonts Dec 10 '19 at 07:43
1

Sample cosine approximation:

  • 0.15 ULPS average (1 ULPS max) error
  • 12x speedup for AVX512, 8-9x for AVX2 approximation of cosine over [-1,1] range.

Polynomial coefficients were found by a genetic algorithm. The multiplication series look like Chebyshev Polynomials but they are not, but needed when you need a wider range of input rather than just [-1,1].

    // only optimized for [-1,1] input range!!
    template<typename Type, int Simd>
    inline
    void cosFast(
       Type * const __restrict__ data,
       Type * const __restrict__ result) noexcept
    {
        alignas(64)
        Type xSqr[Simd];

        alignas(64)
        Type xSqrSqr[Simd];

        alignas(64)
        Type xSqrSqrSqr[Simd];

        alignas(64)
        Type xSqrSqrSqrSqr[Simd];

        #pragma GCC ivdep
        for(int i=0;i<Simd;i++)
        {
            xSqr[i] =   data[i]*data[i];
        }

        #pragma GCC ivdep
        for(int i=0;i<Simd;i++)
        {
            xSqrSqr[i] =    xSqr[i]*xSqr[i];
        }

        #pragma GCC ivdep
        for(int i=0;i<Simd;i++)
        {
            xSqrSqrSqr[i] =     xSqrSqr[i]*xSqr[i];
        }


        #pragma GCC ivdep
        for(int i=0;i<Simd;i++)
        {
            xSqrSqrSqrSqr[i] =  xSqrSqr[i]*xSqrSqr[i];
        }

        #pragma GCC ivdep
        for(int i=0;i<Simd;i++)
        {
            result[i] =     Type(2.37711074060342753000441e-05)*xSqrSqrSqrSqr[i] +
                                Type(-0.001387712893937020908197155)*xSqrSqrSqr[i] +
                                Type(0.04166611039514833692010143)*xSqrSqr[i] +
                                Type(-0.4999998698566363586337502)*xSqr[i] +
                                Type(0.9999999941252593060880827);
        }
    }

If you need to use a wider range, then you should do a range-reduction at high precision and use something like this:

  • range reduce to -pi,pi at high precision
  • divide by "4" to put it in -1,1 range
  • compute same series as above => tmp
  • compute (Chebyshev) L_"4" (tmp) = result

 .L29:
        vmovups ymm7, YMMWORD PTR [r14+rax]
        vmulps  ymm1, ymm7, ymm7
        vmovups ymm7, YMMWORD PTR [r14+32+rax]
        vmulps  ymm3, ymm1, ymm1
        vmulps  ymm6, ymm3, ymm3
        vmulps  ymm6, ymm6, YMMWORD PTR .LC11[rip]
        vmulps  ymm0, ymm7, ymm7
        vmulps  ymm5, ymm3, ymm1
        vfmadd132ps     ymm5, ymm6, YMMWORD PTR .LC12[rip]
        vmulps  ymm2, ymm0, ymm0
        vmulps  ymm6, ymm2, ymm2
        vmulps  ymm6, ymm6, YMMWORD PTR .LC11[rip]
        vfmadd132ps     ymm3, ymm5, YMMWORD PTR .LC13[rip]
        vmulps  ymm4, ymm2, ymm0
        vfmadd132ps     ymm4, ymm6, YMMWORD PTR .LC12[rip]
        vfmadd132ps     ymm1, ymm3, YMMWORD PTR .LC14[rip]
        vfmadd132ps     ymm2, ymm4, YMMWORD PTR .LC13[rip]
        vaddps  ymm1, ymm1, YMMWORD PTR .LC15[rip]
        vfmadd132ps     ymm0, ymm2, YMMWORD PTR .LC14[rip]
        vaddps  ymm0, ymm0, YMMWORD PTR .LC15[rip]
        vmovups YMMWORD PTR [r15+rax], ymm1
        vmovups YMMWORD PTR [r15+32+rax], ymm0
        add     rax, 64
        cmp     rax, 16777216
        jne     .L29
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97