0

Basically I just have to calculate N2 sums of exp functions for N1 * N2 vectors, for simplicity I set the dot product always to 1. With a CPU calculation this would simply be:

static void calculate_exp( float *realpart, float *imaginarypart)
{ 
  float qdotx; 
  for (p = 0; p < N1; p++)
  {
    for (k = 0; k < N2; k++)
    {
        {
            qdotx= 1.;   //normally the dotproduct of input vectors
        }
        float sinvalue, cosvalue;
        sincosf (qdotx,&sinvalue,&cosvalue);
        realpart[k]     += cosvalue;     
        imaginarypart[k]+= sinvalue;
    }
  }
}

Cause of this easy to parallelize structure this is a perfect example to implement in CUDA and to run it on a GPU. I did so and wrote a program which is working quite well and which is much faster than the CPU calculation. But as more threads as I use per block the resulting sum differs more from the originally results. My CUDA code is:

__global__ void calculate_exp(float* output_sin, float* output_cos, float *qdotx){
  int index;
  __shared__ float cache_sin[threadsPerBlock];
  __shared__ float cache_cos[threadsPerBlock];
  for( int ql = 0; ql < N2; ql++)
  {
    int cacheIndex = threadIdx.x;

    float tmp_sin=0. ,tmp_cos=0.;
    cache_cos[cacheIndex] =0.; cache_sin[cacheIndex] =0.;

    for(index= threadIdx.x + blockDim.x*blockIdx.x; index < N1; index += blockDim.x * gridDim.x)
    {                        
        qdotx[index] = 1.;                   
        float sinvalue, cosvalue;
        sincosf (qdotx[index],&sinvalue,&cosvalue);
        tmp_cos += cosvalue;
        tmp_sin += sinvalue;

    }
    cache_cos[cacheIndex] = tmp_cos;
    cache_sin[cacheIndex] = tmp_sin;
    __syncthreads();
    int i = blockDim.x / 2;
    while (i != 0) {
        if (cacheIndex < i) {
            cache_cos[cacheIndex] += cache_cos[cacheIndex +i];
            cache_sin[cacheIndex] += cache_sin[cacheIndex +i];
        }
        __syncthreads();
        i/=2;
    }
    __syncthreads();
    if(cacheIndex == 0 ){
        atomicAdd(&output_cos[ql] , cache_cos [0]);
        atomicAdd(&output_sin[ql] , cache_sin [0]);
    }
    __syncthreads();
  }
}

My kernel call is

int N_p = int(N1/threadsPerBlock) +1;
calculate_exp <<<N_p,threadsPerBlock>>>(dev_output_sin,dev_output_cos,dev_qdotx);

To me is clear that the sincos calculation on the GPU might be slightly different from the standard math GNU-libery. I figured out already that for very large qdotx (i.g. qdotx=1000.) the inaccuracy is increasing which is not surprising, but what I don't understand is, why is the inaccuracy depending on threadsPerBlock?

Example

Take N1=50000, N2=1000 and vary the input for threadsPerBlock=xxx.

The sum of realpart values is always 27015.113831 (so 50000*cos(1)) The difference between the resulting output_cos of the GPU and the realpart values of the CPU is as follows:

 ThreadsPerBlock          sum on GPU      difference
      1                   27026.736328      11.623047
      16                  27014.515625      0.597656
      32                  27014.869141      0.244141
      64                  27015.251953      0.138672              
      >64                 differs with each ql, differences from 9 to 32

I tried double precision instead of single float precision as well but the result stays the same.

I was searching a lot for this question, but I didn't find an answer. All I found was dealing with the implementation of sincos(i.g. here and here) but not with the kernel call.

So I have two questions:

  1. Why is the accuracy depending on the threadsPerBlock even though the calculation should be always the same?

  2. Is there a way to improve the accuracy even for a large value for threadsPerBlock (like threadsPerBlock=256)?

Farino
  • 17
  • 3
  • 1) You have posted the code that does the parallel calculation, but only pseudo-code for the one-thread calculation. To get better help, post the true code used in the single thread version. (types used, what is `COS,SIN`?) Try using `sincosf()` for the single-thread version too. 2) "I tried double precision instead of single float precision as well but the result stays the same." --> did you change the function call `sincosf()` --> `sincos()`? 3) to be clear: do we have a radians/degrees issue. ( I assume all is in radians). – chux - Reinstate Monica Oct 11 '16 at 13:10
  • Thanks for answering. 1.) Yes sorry I was too unclear with that, I added an edited version. 2.) Yes this is exactly what I meant. 3.) Yes we are dealing with radians. – Farino Oct 11 '16 at 13:47
  • Are `realpart, imaginarypart`, `output_sin, output_cos` properly initialized (in unposted code)? – chux - Reinstate Monica Oct 11 '16 at 13:54
  • Yes they are. Do I have to post this as well? I did not want to make the question too long. Maybe to make it clear, if I use <<>> kernel call I get no significant inaccuracies. – Farino Oct 11 '16 at 14:00
  • I have my doubts that the calculation is parceled out correctly. Perhaps using a simple `sincosf (qdotx,&sinvalue,&cosvalue); realpart[k] += cosvalue; imaginarypart[k]+= sinvalue;` --> `realpart[k] += qdotx; imaginarypart[k]+= qdotx;` as a test to detect if both methods generate the same result (IOWS, a coding problem, not an inaccuracy one.) (I have concerns when `i` is odd in `while (i != 0) {`) – chux - Reinstate Monica Oct 11 '16 at 14:08
  • Thanks for the hint. The first advice I already did before and it gives the same result. But if I choose i.g. `qdotx=100` resp `qdotx[index] =100.`(so without paying attention to the vectors) I get differences in sum for `output_cos, output_sin` resp. `realpart, imaginarypart` depending on the value `threadsPerBlock`. What do you mean with the second statement " have concerns when `i` is odd in `while (i != 0)`" ? – Farino Oct 11 '16 at 14:30
  • Are you compiling with `-use_fast_math` by any chance? Also, can you give a *specific* example of a `sincosf()` call you believe delivers a result with large relative error (i.e. specify the exact input, outputs observed, vs outputs expected)? – njuffa Oct 11 '16 at 17:55
  • I notice that the computation of `qdotx` involves a mul-add sequence, for which the CUDA compiler will likely use a fused multiply-add (FMA), which tends to improve accuracy, but can also cause numerical differences with *less accurate* CPU computations without FMA. You can check whether this applies here by turning off FMA generation with `-fmad=false`. Note that this tends to have a negative effect on accuracy and performance. – njuffa Oct 11 '16 at 19:23
  • 2
    I think you're more likely to get a definitive answer if you provide a [mcve]. I'm not convinced that the two calculations are even doing the same arithmetic thing in the case where there are multiple threads per block, but it's difficult to say without a complete example. – Robert Crovella Oct 11 '16 at 21:11
  • Thank you for your response. I'm sorry for my confusing first post, I hope the edited version is more clear. @njuffa Yes I tried `-use_fast_math`, no differences and thanks for the hint with `-fmad=false` I didn't know about that, but this was also not changing anything significantly ... – Farino Oct 12 '16 at 08:32
  • 1
    @Farino: Since `-use_fast_math` has a definite impact on the accuracy of `sincosf()` (it reduces accuracy) this suggests the issue here is not with `sincosf()` and the question title is quite possibly misleading. The fact that `-fmad=false` also make no difference further suggests that there is a functional bug in the code. – njuffa Oct 12 '16 at 08:39
  • 1
    @njuffa Yes you were right, it was a bug, I mixed something with float and double precision, anyway thanks, the comments at least helped me to take a deeper look at the code, because I was quite sure there was something obvious wrong with my code I posted above... For my next question I'll be more accurate – Farino Oct 17 '16 at 08:28
  • I have voted to close this question as it was caused by an error not described in the question and there is no answer to provide to this 3.5 year old question. I would encourage others to do the same – talonmies Jan 01 '21 at 11:59

0 Answers0