2

I'm writing a program in CUDA that makes a huge amount of calls to the sincos() function, using double precision. I'm afraid this is one of the biggest bottlenecks of the code, and I cannot reduce the number of calls to the function.

Is there any decent approximation to sincos in CUDA or in a library I can import? I am also quite concerned with the accuracy, so the better the approximation is, the happier my code will be.

I've also thought about building a lookup table or approximating the values with their taylor series, but I want some opinions before going down that road.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Alejandro
  • 227
  • 1
  • 12

1 Answers1

5

A pretty fast and accurate sincos function is available in the CUDA math api. Just include math.h. Or use sincosf (here) if that will work for you. (I'm aware that you said double precision in your question. Just pointing some things out.)

If you can use sincospif instead of sincosf, @njuffa has worked his magic here, which may interest you.

This question and this question may also interest you.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I'm already using the `sincos` from `math.h`, and I don't know if I'll loose too much accuracy using `sincosf`, what do you think?. Although my angles are not `pi*expr`, thank you for pointing to the `sincospif`! – Alejandro Aug 27 '16 at 02:04
  • 1
    certainly there is a lot of difference in precision (bits) between `sincosf` and `sincos`. I don't know how important it would be to your particular algorithm. Since you're interested in performance, and usually performance and precision are a tradeoff, it seems logical to investigate the sensitivity of your algorithm to the extra bits of precision, given the stipulations in your question. When njuffa comes by, he will be able to answer all your questions. – Robert Crovella Aug 27 '16 at 02:09
  • 2
    @Alejandro In addition to applicability of `sincos` there could be other special usage patterns. Some codes use sine and cosine in regular angle increments, which allows those values to be computed without calls to `sincos`. Other codes use sine and cosine in conjunction with inverse trig functions, such uses can often be replaced with potentially cheaper algebraic computation. You might want to consider asking a question on how sine and cosine calls can be reduced for your particular use case. – njuffa Aug 27 '16 at 02:14
  • 2
    If this is related to your [previous question](http://stackoverflow.com/questions/39171823/cuda-parallelized-raytracer-very-low-speedup), I think you may have your priorities mixed up. I can't imagine an optimized `sincos` providing more than 10% benefit. On the other hand, launching blocks of 5 threads in CUDA is borderline silly. You're leaving **more** than 27/32 of the available performance of your GPU on the table, meaning fixing that could lead to 6-10x speedup. You should pay attention to the advice given to you by @tera in the comments to that question. Try to use 128 threads per block – Robert Crovella Aug 27 '16 at 02:44
  • Yes, @RobertCrovella, it is related, but I did change that in my code and I'm trying to further optimize it. My algorithm is now 15x-20x faster than the sequential code. I've still not answered those comments because I'm still testing their advices. **This** question is still relevant, though, as I profiled my code **after** changing the grid-block dimensions and saw `sincos` is a huge bottleneck. If you want to answer my other question, let's discuss it _there_, thank you! – Alejandro Aug 27 '16 at 10:16
  • I am a bit puzzled how the profiler would attribute time to `sincos` since the function should be inlined in a release build (except for the reduction code for very large arguments, which is a called subroutine, which however should "never" be invoked). What happened to execution time when you replaced `sincos` with `sincosf`, as had been suggested? – njuffa Aug 27 '16 at 16:11