Basically I just have to calculate N2 sums of exp functions for N1 * N2 vectors, for simplicity I set the dot product always to 1. With a CPU calculation this would simply be:
static void calculate_exp( float *realpart, float *imaginarypart)
{
float qdotx;
for (p = 0; p < N1; p++)
{
for (k = 0; k < N2; k++)
{
{
qdotx= 1.; //normally the dotproduct of input vectors
}
float sinvalue, cosvalue;
sincosf (qdotx,&sinvalue,&cosvalue);
realpart[k] += cosvalue;
imaginarypart[k]+= sinvalue;
}
}
}
Cause of this easy to parallelize structure this is a perfect example to implement in CUDA and to run it on a GPU. I did so and wrote a program which is working quite well and which is much faster than the CPU calculation. But as more threads as I use per block the resulting sum differs more from the originally results. My CUDA code is:
__global__ void calculate_exp(float* output_sin, float* output_cos, float *qdotx){
int index;
__shared__ float cache_sin[threadsPerBlock];
__shared__ float cache_cos[threadsPerBlock];
for( int ql = 0; ql < N2; ql++)
{
int cacheIndex = threadIdx.x;
float tmp_sin=0. ,tmp_cos=0.;
cache_cos[cacheIndex] =0.; cache_sin[cacheIndex] =0.;
for(index= threadIdx.x + blockDim.x*blockIdx.x; index < N1; index += blockDim.x * gridDim.x)
{
qdotx[index] = 1.;
float sinvalue, cosvalue;
sincosf (qdotx[index],&sinvalue,&cosvalue);
tmp_cos += cosvalue;
tmp_sin += sinvalue;
}
cache_cos[cacheIndex] = tmp_cos;
cache_sin[cacheIndex] = tmp_sin;
__syncthreads();
int i = blockDim.x / 2;
while (i != 0) {
if (cacheIndex < i) {
cache_cos[cacheIndex] += cache_cos[cacheIndex +i];
cache_sin[cacheIndex] += cache_sin[cacheIndex +i];
}
__syncthreads();
i/=2;
}
__syncthreads();
if(cacheIndex == 0 ){
atomicAdd(&output_cos[ql] , cache_cos [0]);
atomicAdd(&output_sin[ql] , cache_sin [0]);
}
__syncthreads();
}
}
My kernel call is
int N_p = int(N1/threadsPerBlock) +1;
calculate_exp <<<N_p,threadsPerBlock>>>(dev_output_sin,dev_output_cos,dev_qdotx);
To me is clear that the sincos
calculation on the GPU might be slightly different from the standard math GNU-libery.
I figured out already that for very large qdotx
(i.g. qdotx=1000.
) the inaccuracy is increasing which is not surprising, but what I don't understand is, why is the inaccuracy depending on threadsPerBlock?
Example
Take N1=50000, N2=1000
and vary the input for threadsPerBlock=xxx
.
The sum of realpart
values is always 27015.113831 (so 50000*cos(1)
)
The difference between the resulting output_cos
of the GPU and the realpart
values of the CPU is as follows:
ThreadsPerBlock sum on GPU difference
1 27026.736328 11.623047
16 27014.515625 0.597656
32 27014.869141 0.244141
64 27015.251953 0.138672
>64 differs with each ql, differences from 9 to 32
I tried double precision instead of single float precision as well but the result stays the same.
I was searching a lot for this question, but I didn't find an answer. All I found was dealing with the implementation of sincos
(i.g. here and here) but not with the kernel call.
So I have two questions:
Why is the accuracy depending on the
threadsPerBlock
even though the calculation should be always the same?Is there a way to improve the accuracy even for a large value for
threadsPerBlock
(likethreadsPerBlock=256
)?