Vectorized Trig functions in C?

Question

I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.

When I compile a block

for(int i=0; i<SIZE; i++) {
   arr[i]=sin((float)i/1024);
}

GCC won't vectorize it, and says

not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);

Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.

With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.

A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?

EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.

score 4 · Answer 1 · answered Feb 24 '11 at 20:36

4

Since you said you were using GCC it looks like there are some options:

http://gruntthepeon.free.fr/ssemath/
- This uses SSE and SSE2 instructions to implement it.
http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php
- This has an alternate implementation. Some of the comments are pretty good.

That said, I'd probably look into GPGPU for a solution. Maybe writing it in CUDA or OpenCL (If I remember correctly CUDA supports the sine function). Here are some libraries that look like they might make it easier.

answered Feb 24 '11 at 20:36

Joe

3,059
2
22
28

1

OpenCL provides vector overloads for all of the math functions. I believe that CUDA does as well. – Stephen Canon Feb 24 '11 at 21:32
CUDA / OpenCL solution will likely perform worse due to the latency of sending data to/from the PCI-e link. You could check/estimate this with a back of an envelope calculation, but 1024 floats is only 4k of data. Not a lot. The overhead of calling the function to send data to / get data back out from the GPU will likely take longer than transfering this data, or having the CPU process it. If I'm not correct feel free to show a few lines of arithmatic explaining why. – FreelanceConsultant Jun 06 '22 at 08:51

score 2 · Answer 2 · answered Feb 24 '11 at 20:28

2

Since you are looking to calculate harmonics here, I have some code that addressed a similar problem. It is vectorized already and faster than anything else I have found. As a side benefit, you get the cosine for free.

answered Feb 24 '11 at 20:28

Seth

2,683
18
18

Nice, but I'd like something a little more portable than assembly. – Jeremy Salwen Feb 24 '11 at 20:40
In that case, I would highly recommend NETLIB. You can find a good implementation in AMD's framewave library: http://framewave.svn.sourceforge.net/viewvc/framewave/trunk/Framewave/domain/common/include/Trigonometric_NETLIB.h?revision=HEAD&view=markup – Seth Feb 24 '11 at 20:48

score 1 · Answer 3 · answered Feb 24 '11 at 20:24

1

What platform are you using? Many libraries of this sort already exist:

Intel's provides the Vector Math Library (VML) with icc.
Apple provides the vForce library as part of the Accelerate framework.
HP provides their own Vector Math Library for Itanium (and may other architectures, too).
Sun provided libmvec with their compiler tools.
...

answered Feb 24 '11 at 20:24

Stephen Canon

103,815
19
183
269

I am on Debian/Linux x86_64, GCC. Ideally I'd like something that works on most of the platforms GCC supports vectorization for. – Jeremy Salwen Feb 24 '11 at 20:29

score 1 · Answer 4 · answered Feb 24 '11 at 21:20

1

Instead of the taylor series, I would look at the algorithms fdlibm uses. They should get you as much precision with fewer steps.

answered Feb 24 '11 at 21:20

R.. GitHub STOP HELPING ICE

208,859
35
376
711

fdlibm just uses a polynomial approximation: http://www.netlib.org/fdlibm/k_sin.c – Jeremy Salwen Mar 24 '11 at 00:57
2

@JeremySalwen … but the coefficients of these polynomials are not taken from the Taylor series; they were chosen using more subtle methods: http://en.wikipedia.org/wiki/Minimax_approximation_algorithm . And if you look at k_cos.c, which is harder than k_sin.c because larger arguments makes one compute with large ULPs for a small result expected to be precise to a small ULP, you'll notice an interesting compensation technique, which makes vectorization trickier. – Pascal Cuoq Sep 09 '13 at 13:50

score 0 · Accepted Answer · answered Jun 08 '11 at 20:52

0

My answer was to create my own library to do exactly this called vectrig: https://github.com/jeremysalwen/vectrig

answered Jun 08 '11 at 20:52

Jeremy Salwen

8,061
5
50
73

Vectorized Trig functions in C?

5 Answers5

Linked