Best way to approach using sincos() in CUDA

Question

I am not clear on what should be the best way to implement sincos(). I've looked up everywhere but it seems the consensus is simply that it is better than doing separate computation of sin and cos. Below is essentially what I have in my kernel for using sincos. However, when I clock it against just doing sin and cos separately it comes out slower. I think it has to do with how I'm using my cPtr and sPtr. Is there a better way?

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < dataSize)
{
    idx += lower;
    double f = ((double) idx) * deltaF;
    double cosValue;
    double sinValue;
    double *sPtr = &sinValue;
    double *cPtr = &cosValue;
    sincos(twopit * f, sPtr, cPtr);

    d_re[idx - lower] = cosValue;
    d_im[idx - lower] = - sinValue;

    //d_re[idx - lower] = cos(twopit * f);
    //d_im[idx - lower] = - sin(twopit * f);
}

Last I checked on a C2050 with CUDA 4.2, sincos() could produce function values at about 1.3x the rate of seperate sin() and cos() calls. This is from memory, I am not in front of a CUDA-enabled machine at the moment to double-check. What GPU do you use, and what are the actual times measured for the two variants? It seems the code is simply filling a (lengthy?) array with sine and cosine values? If so, your code could be memory bandwidth bound, and may not limited by the rate at which you can evaluate these trig functions. — njuffa, Jul 20 '12 at 07:41
Have you checked the register foot print of the `sincos()` kernel versus the alternative using `sin()` and `cos()`? It could be an occupancy issue changing the peformance of the kernel. — talonmies, Jul 20 '12 at 08:19
Thanks talonmies you are right about that. I fixed it and it does increase performance as it should. — harmonickey, Jul 20 '12 at 15:13
If the variable twopit includes the factor π, you might also want to try sinpi(), cospi(), and sincospi() [the last available in CUDA 5.0]. Due to a simplified argument reduction, these functions have a lower register foot print than the regular trig functions. For example, if the variable twopit = 2*π*t, sin (twopit * f) would become sinpi (twot * f), where twot = 2*t. — njuffa, Jul 20 '12 at 16:02
And check out the [single precision intrinsics](https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html) for a real speed boost. Beware the [inaccuracies](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsic-functions) — Tyson Hilmer, May 14 '19 at 10:56

score 5 · Accepted Answer · answered Jul 20 '12 at 07:33

The pointers are redundant - you can get rid of them, e.g.

double cosValue;
double sinValue;
sincos(twopit * f, &sinValue, &cosValue);

but I'm not sure this will have much effect on performance (worth a try though).

Also consider using float rather than double where precision requirements permit, and use the corresponding single precision functions (sincosf in this case).

Best way to approach using sincos() in CUDA

1 Answers1

Linked