I am not clear on what should be the best way to implement sincos(). I've looked up everywhere but it seems the consensus is simply that it is better than doing separate computation of sin and cos. Below is essentially what I have in my kernel for using sincos. However, when I clock it against just doing sin and cos separately it comes out slower. I think it has to do with how I'm using my cPtr and sPtr. Is there a better way?
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < dataSize)
{
idx += lower;
double f = ((double) idx) * deltaF;
double cosValue;
double sinValue;
double *sPtr = &sinValue;
double *cPtr = &cosValue;
sincos(twopit * f, sPtr, cPtr);
d_re[idx - lower] = cosValue;
d_im[idx - lower] = - sinValue;
//d_re[idx - lower] = cos(twopit * f);
//d_im[idx - lower] = - sin(twopit * f);
}