8

I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.

When I compile a block

for(int i=0; i<SIZE; i++) {
   arr[i]=sin((float)i/1024);
}

GCC won't vectorize it, and says

not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);

Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.

With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.

A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?

EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.

Kevin Reid
  • 37,492
  • 13
  • 80
  • 108
Jeremy Salwen
  • 8,061
  • 5
  • 50
  • 73

5 Answers5

4

Since you said you were using GCC it looks like there are some options:

That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.

Joe
  • 3,059
  • 2
  • 22
  • 28
  • 1
    OpenCL provides vector overloads for all of the math functions. I believe that CUDA does as well. – Stephen Canon Feb 24 '11 at 21:32
  • CUDA / OpenCL solution will likely perform worse due to the latency of sending data to/from the PCI-e link. You could check/estimate this with a back of an envelope calculation, but 1024 floats is only 4k of data. Not a lot. The overhead of calling the function to send data to / get data back out from the GPU will likely take longer than transfering this data, or having the CPU process it. If I'm not correct feel free to show a few lines of arithmatic explaining why. – FreelanceConsultant Jun 06 '22 at 08:51
2

Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.

Seth
  • 2,683
  • 18
  • 18
  • Nice, but I'd like something a little more portable than assembly. – Jeremy Salwen Feb 24 '11 at 20:40
  • In that case, I would highly recommend NETLIB. You can find a good implementation in AMD's framewave library: http://framewave.svn.sourceforge.net/viewvc/framewave/trunk/Framewave/domain/common/include/Trigonometric_NETLIB.h?revision=HEAD&view=markup – Seth Feb 24 '11 at 20:48
1

What platform are you using? Many libraries of this sort already exist:

  • Intel's provides the Vector Math Library (VML) with icc.
  • Apple provides the vForce library as part of the Accelerate framework.
  • HP provides their own Vector Math Library for Itanium (and may other architectures, too).
  • Sun provided libmvec with their compiler tools.
  • ...
Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • I am on Debian/Linux x86_64, GCC. Ideally I'd like something that works on most of the platforms GCC supports vectorization for. – Jeremy Salwen Feb 24 '11 at 20:29
1

Instead of the taylor series, I would look at the algorithms fdlibm uses. They should get you as much precision with fewer steps.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • fdlibm just uses a polynomial approximation: http://www.netlib.org/fdlibm/k_sin.c – Jeremy Salwen Mar 24 '11 at 00:57
  • 2
    @JeremySalwen … but the coefficients of these polynomials are not taken from the Taylor series; they were chosen using more subtle methods: http://en.wikipedia.org/wiki/Minimax_approximation_algorithm . And if you look at k_cos.c, which is harder than k_sin.c because larger arguments makes one compute with large ULPs for a small result expected to be precise to a small ULP, you'll notice an interesting compensation technique, which makes vectorization trickier. – Pascal Cuoq Sep 09 '13 at 13:50
0

My answer was to create my own library to do exactly this called vectrig: https://github.com/jeremysalwen/vectrig

Jeremy Salwen
  • 8,061
  • 5
  • 50
  • 73