0

I programmed a function that rotates four 3d points around it own arbitrary 3d axis using SIMD. It works but the performance is terrible! Only 15% better than calling a function four times that rotates only one point at a time. I was hoping for 200%-300% improvement. One issue is, the compiler doesn't recognize the SIMD trig functions. Even if I was able to use them I don't know if my CPU's supports them (Ryzen 3 1200 and Ryzen 7 3700).
I'm hoping to use SIMD to calculate elliptical orbits since there's a lot of trig involved. I've been searching online but really can't find anything helpful. Here's the code.

vec12d MathFunction::Rotate_SIMD(const vec4d& angle, const vec12d& axis, const vec12d& point)
{
    vec4d get_cosine = vec4d(cos(angle.v[0]), cos(angle.v[1]), cos(angle.v[2]), cos(angle.v[3]));
    vec4d get_sine = vec4d(sin(angle.v[0]), sin(angle.v[1]), sin(angle.v[2]), sin(angle.v[3]));

    __m256d cosTheta = _mm256_loadu_pd(get_cosine.v);
    __m256d sinTheta = _mm256_loadu_pd(get_sine.v);
    __m256d one = _mm256_set1_pd(1.0);
    __m256d minus_cosTheta = _mm256_sub_pd(one, cosTheta);

    __m256d Xaxis = _mm256_loadu_pd(axis.v);
    __m256d Yaxis = _mm256_loadu_pd(&axis.v[4]);
    __m256d Zaxis = _mm256_loadu_pd(&axis.v[8]);

    __m256d Xpoint = _mm256_loadu_pd(point.v);
    __m256d Ypoint = _mm256_loadu_pd(&point.v[4]);
    __m256d Zpoint = _mm256_loadu_pd(&point.v[8]);

    // Get new Xaxis points
    __m256d Xnew = _mm256_mul_pd(Xaxis, Xaxis);
    Xnew = _mm256_mul_pd(Xnew, minus_cosTheta);
    Xnew = _mm256_add_pd(Xnew, cosTheta);
    Xnew = _mm256_mul_pd(Xnew, Xpoint);

    __m256d temp = _mm256_mul_pd(minus_cosTheta, Xaxis);
    temp = _mm256_mul_pd(temp, Yaxis);
    __m256d temp2 = _mm256_mul_pd(Zaxis, sinTheta);
    temp = _mm256_sub_pd(temp, temp2);
    temp = _mm256_mul_pd(temp, Ypoint);
    Xnew = _mm256_add_pd(Xnew, temp);

    temp = _mm256_mul_pd(minus_cosTheta, Xaxis);
    temp = _mm256_mul_pd(temp, Zaxis);
    temp2 = _mm256_mul_pd(Yaxis, sinTheta);
    temp = _mm256_add_pd(temp, temp2);
    temp = _mm256_mul_pd(temp, Zpoint);
    Xnew = _mm256_add_pd(Xnew, temp);

    // Get new Yaxis points
    __m256d Ynew = _mm256_mul_pd(minus_cosTheta, Xaxis);
    Ynew = _mm256_mul_pd(Ynew, Yaxis);
    temp2 = _mm256_mul_pd(Zaxis, sinTheta);
    Ynew = _mm256_add_pd(Ynew, temp2);
    Ynew = _mm256_mul_pd(Ynew, Xpoint);

    temp = _mm256_mul_pd(Yaxis, Yaxis);
    temp = _mm256_mul_pd(temp, minus_cosTheta);
    temp = _mm256_add_pd(temp, cosTheta);
    temp = _mm256_mul_pd(temp, Ypoint);
    Ynew = _mm256_add_pd(Ynew, temp);

    temp = _mm256_mul_pd(minus_cosTheta, Yaxis);
    temp = _mm256_mul_pd(temp, Zaxis);
    temp2 = _mm256_mul_pd(Xaxis, sinTheta);
    temp = _mm256_sub_pd(temp, temp2);
    temp = _mm256_mul_pd(temp, Zpoint);
    Ynew = _mm256_add_pd(Ynew, temp);

    // Get new Zaxis points
    __m256d Znew = _mm256_mul_pd(minus_cosTheta, Xaxis);
    Znew = _mm256_mul_pd(Znew, Zaxis);
    temp2 = _mm256_mul_pd(Xaxis, sinTheta);
    Znew = _mm256_sub_pd(Ynew, temp2);
    Znew = _mm256_mul_pd(Ynew, Xpoint);

    temp = _mm256_mul_pd(minus_cosTheta, Yaxis);
    temp = _mm256_mul_pd(temp, Zaxis);
    temp2 = _mm256_mul_pd(Xaxis, sinTheta);
    temp = _mm256_add_pd(temp, temp2);
    temp = _mm256_mul_pd(temp, Ypoint);
    Znew = _mm256_add_pd(Znew, temp);

    temp = _mm256_mul_pd(minus_cosTheta, Zaxis);
    temp = _mm256_mul_pd(temp, Zaxis);
    temp = _mm256_add_pd(temp, cosTheta);
    temp = _mm256_mul_pd(temp, Zpoint);
    Znew = _mm256_add_pd(Znew, temp);

    vec12d Npoint __attribute__ ((aligned(32)));  // Four points rotated goes here.
    _mm256_store_pd(Npoint.v, Xnew);
    _mm256_store_pd(&Npoint.v[4], Ynew);
    _mm256_store_pd(&Npoint.v[8], Znew);

    return Npoint;
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Joel P
  • 9
  • 2
  • If the program is working, that must mean that your compiler recognized the SIMD trig functions and your CPU supports them, no? Otherwise it wouldn't have compiled or ran? – Jeremy Friesner Nov 01 '21 at 01:27
  • 1
    If you are referring to intrinsics like `_mm256_sincos_pd()`, those are part of *Intel*'s SVML (short vector math library) and marked as such in Intel's SIMD Intrinsic Guide. If you use Intel's toolchain, they will be available. – njuffa Nov 01 '21 at 01:32
  • @JeremyFriesner: This code *isn't* using SIMD trig functions, it's breaking out the elements for scalar `cos` and `sin` (separately, not even using `sincos`) and putting them back together with a wrapper for `_mm256_setr_pd`. That's why it's slow. Unlike the usual questions like this, they aren't usi a SIMD math function they found in Intel's intrinsics guide (and getting a build error from it because that includes SVML library functions as well as real intrinsics). – Peter Cordes Nov 01 '21 at 01:32
  • Old pre-AVX duplicate with some library links: [Vectorized Trig functions in C?](https://stackoverflow.com/q/5109864) Newer Q&A about *auto*-vectorizing `sin()`: [Vectorization of sin and cos](https://stackoverflow.com/q/39591020) – Peter Cordes Nov 01 '21 at 01:35
  • Yeah, I'm referring to the _mm256_sincos_pd(). My compiler didn't recognize those so I'm just using the standard math functions. I'm not familiar with Intel toolchain. Can I use it with the gcc compiler? – Joel P Nov 01 '21 at 01:36
  • [Is it possible to get multiple sines in AVX/SSE?](https://stackoverflow.com/q/27933355) has some info about using GCC's libmvec functions. I think it is also possible to install and use Intel SVML with GCC if you have a copy of the binary library. – Peter Cordes Nov 01 '21 at 01:38
  • Thanks for help. I found a SIMD library on Github that has trig functions. Now I'm getting a 5x speedup. Their trig functions are faster than the scalar ones and almost as accurate. – Joel P Nov 02 '21 at 20:12
  • CTRL+F in this document and search for 'svm' (without quotes) and you'll get the infro you need to use SVML on gcc: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html – dave_thenerd Nov 15 '21 at 02:10

0 Answers0