0

I wrote simple kernel code, trying to manipulate one dimensional array elements:

    #include "stdio.h"

__global__ void Loop(double *X, int CellsNum, int VarNum,const double constant1)
{

int idx = threadIdx.x+blockDim.x*blockIdx.x;
int i = (idx+1)*VarNum ;
double exp1,exp2,exp3,exp4 ;

if(idx<CellsNum-2) {

exp1=double(0.5)*(X[i+6+VarNum]+X[i+6])+X[i+10] ;
exp2=double(0.5)*(X[i+8+VarNum]+X[i+8]) ;

if(i==0) {
printf("%e %e",exp1,exp2) ;
}

exp3=X[i+11]-constant1*(exp1*exp2)/X[i+5] ;

exp4=constant1*(X[i+9]*exp1-X[i+9-VarNum]*exp2)/X[i+5] ;

X[i+12]=exp3+exp4;
}
}

extern "C" void cudacalc_(double *a, int* N1, int* N2, double* N3)
{
int Cells_Num = *N1;
int Var_Num = *N2;
double constant1 = *N3;

Loop<<<1,Cells_Num>>>(a,Cells_Num,Var_Num,constant1);

}

But it doesn't work if I comment this piece of code:

if(i==0) {
printf("%e %e",exp1,exp2) ;
}

even when variable i always greater then zero. Than I do comment this lines code produces NaN in X array. I'm trying to run this code compiled with -arch sm_20 flag on Tesla GPU. Maybe somebody can help me with this issue ?

  • 1
    You're doing no [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) in this code. Please add cuda error checking. I suspect your kernel is not running properly. When you have trouble with a particular piece of CUDA code, you should start by making sure you are doing proper cuda error checking on all cuda API calls and all kernel calls. Furthermore, SO expects you to provide a complete, compilable code that demonstrates the issue, as well as the observed results and expected results. Refer to SSCCE.org. – Robert Crovella Aug 23 '13 at 22:57
  • It's possible that when you comment out the printf, the compiler can more aggressively optimize around `exp1` and `exp2`, but the more aggressive optimization may require more register usage. If the number of registers per thread times the number of threads per block exceeds the limit, your kernel will fail to launch (and return an error code). – Robert Crovella Aug 23 '13 at 23:01
  • Thank you for your answers ! I will work around correct cuda error checking ! But if kernel code fails to launch than there should be no change in array X, but I see that Not a Numbers are produced.. and this behaviour persists than I use -O0 flag for compiling. – Yakovenko Ivan Aug 23 '13 at 23:06
  • OK, then add the error checking and provide a complete reproducer. SO expects: "Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance. " – Robert Crovella Aug 23 '13 at 23:15
  • By the way you have plenty of opportunity for race conditions in this code. Each thread writes to `X[i+12]`. But each thread also reads from several different locations in `X`, so threads are reading values that are being written by other threads. That looks like trouble if `Cells_Num` is larger than 32. Adding or removing the `printf` statement could be affecting the outcome of the various races. See if the problem goes away by having the code write to a separate output array, like `Xo[i+12]` or something like that. – Robert Crovella Aug 23 '13 at 23:22
  • Thank you again! You absolutely right, there was data race. Not in this kernel particularly, but between sequential executions of this kernel in program. Proper synchronization resolved the issue. – Yakovenko Ivan Aug 24 '13 at 00:56

1 Answers1

0

This kernel has the opportunity for a race condition, because the kernel code is both reading from X and writing to X with no synchronization or protection.

The simplest way to fix this is probably to separate the output statement to write to a different array:

Xo[i+12]=exp3+exp4;

cuda-memcheck can help check for race conditions within a kernel. Use cuda-memcheck --help to find the specific racecheck options.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257