1

I'm having a weird problem with my code. If I try to print the value of a certain variable inside a thread nothing gets written to the screen and all the threads stop at that point. Here is the code:

    #define WINSIZE 1
    const int nebsize=(WINSIZE*2+1)*(WINSIZE*2+1);

    __global__ void loop(double *img, int *consts, int w, int h, double epsilon){

        int ind=blockIdx.x*blockDim.x+threadIdx.x;
        if(ind<w*h && !consts[ind] && ind%w>=WINSIZE && ind%w<w-WINSIZE && ind/w>=WINSIZE && ind/w<h-WINSIZE){
            int win_inds[nebsize];
            double winI[3*(2*WINSIZE+1)*(2*WINSIZE+1)];
            double winI_re_aux[3*nebsize];
            double pre_win_var[9];
            double win_var[9];
            double win_mu[3];
            double tvals[nebsize*nebsize];
            double detwin;
            int min_i=ind%w-WINSIZE;
            int max_i=ind%w+WINSIZE;
            int min_j=ind/w-WINSIZE;
            int max_j=ind/w+WINSIZE;
            int k;
            int l;
            k=0;        
            for(int i=min_i; i<=max_i; i++){
                for(int j=min_j; j<=max_j; j++){
                    win_inds[k]=h*i+j;
                    k++;
                }
            }
            k=0;
            for(int j=min_j; j<=max_j; j++){        
                l=0;
                for(int i=min_i; i<=max_i; i++){
                    winI[3*(l*(2*WINSIZE+1)+k)]=img[3*(j*w+i)];
                    winI[3*(l*(2*WINSIZE+1)+k)+1]=img[3*(j*w+i)+1];
                    winI[3*(l*(2*WINSIZE+1)+k)+2]=img[3*(j*w+i)+2];
                    l++;
                }
                k++;
            }

            win_mu[0]=0;
            win_mu[1]=0;
            win_mu[2]=0;    
            for(int i=0; i<nebsize; i++){
                win_mu[0]+=winI[3*i];
                win_mu[1]+=winI[3*i+1];
                win_mu[2]+=winI[3*i+2];
            }
            win_mu[0]=win_mu[0]/(double)nebsize;
            win_mu[1]=win_mu[1]/(double)nebsize;
            win_mu[2]=win_mu[2]/(double)nebsize;
            //all ok here

            //this works here
            if(ind==200){   
                    printf("%f\n", win_var[8]);
            }

            for(int i=0; i<3; i++){
                for(int j=0; j<3; j++){
                    pre_win_var[3*i+j]=0;
                    for(int n=0; n<nebsize; n++){
                        pre_win_var[3*i+j]+=winI[3*n+i]*winI[3*n+j];
                    }
                    pre_win_var[3*i+j]=pre_win_var[3*i+j]/(double)nebsize;
                    pre_win_var[3*i+j]+=(i==j)*epsilon/(double)nebsize-win_mu[j]*win_mu[i];
                }
            }
            //this kills all threads          
            if(ind==200){   
                    printf("%f\n", win_var[8]);
            }
            detwin=pre_win_var[0]*pre_win_var[4]*pre_win_var[8]+pre_win_var[2]*pre_win_var[3]*pre_win_var[7]+pre_win_var[1]*pre_win_var[5]*pre_win_var[6];
            detwin-=pre_win_var[6]*pre_win_var[4]*pre_win_var[2]+pre_win_var[3]*pre_win_var[1]*pre_win_var[8]+pre_win_var[7]*pre_win_var[5]*pre_win_var[0];

            win_var[0]=(pre_win_var[4]*pre_win_var[8]-pre_win_var[5]*pre_win_var[7])/detwin;
            win_var[3]=-(pre_win_var[3]*pre_win_var[8]-pre_win_var[5]*pre_win_var[6])/detwin;
            win_var[6]=(pre_win_var[3]*pre_win_var[7]-pre_win_var[4]*pre_win_var[6])/detwin;
            win_var[1]=-(pre_win_var[1]*pre_win_var[8]-pre_win_var[2]*pre_win_var[7])/detwin;
            win_var[4]=(pre_win_var[0]*pre_win_var[8]-pre_win_var[2]*pre_win_var[6])/detwin;
            win_var[7]=-(pre_win_var[0]*pre_win_var[7]-pre_win_var[1]*pre_win_var[6])/detwin;
            win_var[2]=(pre_win_var[1]*pre_win_var[5]-pre_win_var[2]*pre_win_var[4])/detwin;
            win_var[5]=-(pre_win_var[0]*pre_win_var[5]-pre_win_var[2]*pre_win_var[3])/detwin;
            win_var[8]=(pre_win_var[0]*pre_win_var[4]-pre_win_var[1]*pre_win_var[3])/detwin;                

            //this line gets executed in all threads if I printf nothing
            consts[ind]=666;

        }
    }

Printing the values of win_var or pre_win_var is possible only before the values are calculated, but if I try to print them after that it seems to kill all the threads. If I print nothing the line consts[ind]=666 gets executed in all threads, I know it because I can copy consts back to the host memory and print it. So, anyone has any idea of what's wrong?

protas
  • 617
  • 1
  • 5
  • 10
  • 1
    You likely have an illegal access in your code. Since most of your code activity does not affect any global state, the compiler will optimize it out, including the code that is doing the illegal access. Somehow, the `printf` is affecting the compiler optimization decision in a non-obvious way, which is affecting what code gets thrown away. Use the method described in the answer [here](http://stackoverflow.com/questions/27277365) in your failing case, to identify the exact line of code that is generating the illegal access. Voting to close as questions like this are expected to include an MCVE – Robert Crovella Sep 02 '15 at 20:33
  • That's what I get: ========= CUDA-MEMCHECK ========= Program hit cudaErrorLaunchOutOfResources (error 7) due to "too many resources requested for launch" on CUDA API call to cudaLaunch. I guess I'm out of memory. – protas Sep 02 '15 at 20:55
  • @protas: probably out of registers. try reducing the block size – talonmies Sep 02 '15 at 20:57
  • That makes more sense. I couldn't tie the printf directly to code inclusion. But the printf can definitely affect register usage. – Robert Crovella Sep 02 '15 at 21:01
  • The error goes away If I comment out the printf line. – protas Sep 02 '15 at 21:01
  • @talonmies: Using 16 blocks instead of 8 and 300 threads instead of 600 did it. Thanks! – protas Sep 02 '15 at 21:07

1 Answers1

2

The problem appears to be one of resource exhaustion. You are getting cudaErrorLaunchOutOfResources at launch with printf enabled because of the larger register footprint of the kernel with the ABI call included.

You didn't provide any details about your launch parameters, but reducing the total threads per block to a smaller multiple of 32 should cure the problem.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • What's better, having the number of threads as a multiple of 32 or the number of blocks as a multiple of the number of multiprocessors in the GPU? I'm trying to calculate the number of blocks and threads on execution time, based on the number of pixels of the input image and the number of multiprocessors on the GPU. – protas Sep 02 '15 at 21:42
  • 1
    Both, ideally. But threads per block being a multiple of the warp size is most important. – talonmies Sep 02 '15 at 21:44