CUDA c program strange behaviour - kernel works until I add more operations

Question

I am puzzled by the following program (code below). It works fine and gives correct results when the two lines in the kernel defining specsin and speccos are given by (note the second term, which is sin(t)):

specsin+=sin(pi*t/my_tau)*sin(t)*sin(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
speccos+=sin(pi*t/my_tau)*sin(t)*cos(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));

Once I change this second sin(t) term to sin(t+0.0*my_a0*my_a0), which shouldn't change the result, I get all zeros instead of correct answer.

Can it be that I ran out of kernel memory?

#include <stdio.h>
__global__ void Calculate_Spectrum(float * d_Detector_Data, int numCols, int numRows,
                                const float omega_min, float dOmega,
                               const float a0_min, float da0,
                               const float tau_min, float dtau, float dt)
{

    int Global_x = blockIdx.x * blockDim.x + threadIdx.x;
    int Global_y = blockIdx.y * blockDim.y + threadIdx.y;
    int Position1D = Global_y * numCols + Global_x;

    float my_omega=omega_min + Global_x * dOmega;
    float my_a0=a0_min + Global_y*da0;
    float my_tau=tau_min;
    int total_time_steps=int(my_tau/dt);

    float specsin=0.0;
    float speccos=0.0;
    float t=0.0;
    float pi=3.14159265359;

    for(int n=0; n<total_time_steps; n++)
    {
        t=n*dt;
        specsin+=sin(pi*t/my_tau)*sin(t+0.0*my_a0*my_a0)*sin(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
        speccos+=sin(pi*t/my_tau)*sin(t+0.0*my_a0*my_a0)*cos(my_omega*(t+my_a0*my_a0/4.0/pi*(2.0*pi*t-my_tau*sin(2*pi*t/my_tau))));
    }

    d_Detector_Data[Position1D]=(specsin*specsin+speccos*speccos)*dt*dt*my_a0*my_a0*my_omega*my_omega/4.0/pi/pi;
}


int main(int argc, char ** argv)
{
    const int omega_bins = 1024;
    const int a0_bins = 512;
    const int tau_bins = 1;

    const float omega_min = 0.5;
    const float omega_max = 1.1;
    const float a0_min = 0.05;
    const float a0_max = 1.0;
    const float tau_min = 1200;
    const float tau_max = 600;

    const int steps_per_period=20;  // for integrating
    float dt=1.0/steps_per_period;

    int TotalSize = omega_bins * a0_bins * tau_bins;

    float dOmega=(omega_max-omega_min)/(omega_bins-1);
    float da0=(a0_max-a0_min)/(a0_bins-1);
    float dtau=0.;

    float * d_Detector_Data;
    int * d_Global_x;
    int * d_Global_y;


float h_Detector_Data[TotalSize];

    // allocate GPU memory
    cudaMalloc((void **) &d_Detector_Data, TotalSize*sizeof(float));

    Calculate_Spectrum<<<dim3(1,a0_bins,1), dim3(omega_bins,1,1)>>>(d_Detector_Data, omega_bins, a0_bins, omega_min, dOmega, a0_min, da0, tau_min, dtau, dt);


cudaMemcpy(h_Detector_Data, d_Detector_Data, TotalSize*sizeof(float), cudaMemcpyDeviceToHost);

    FILE * SaveFile;
    char TempStr[255];

    sprintf(TempStr, "result.dat");
    SaveFile = fopen(TempStr, "w");

    int counter=0;

    for(int j=0; j<a0_bins;j++)
    {
        for(int i=0; i<omega_bins; i++)
        {
    fprintf(SaveFile,"%e\t", h_Detector_Data[counter]);
    counter++; 

        }
    fprintf(SaveFile, "\n");
    }
    fclose(SaveFile);

// free GPU memory
return 0;

}

Thanks, janisz. That helps. So I am really running out of shared memory for all the kernels? Any general ideas on how to improve the code so that it occupies less memory? — Sleepyhead, Jun 30 '13 at 20:37
When I ran your code, it seems to run fine for me. That is, the `result.dat` file did not contain all zeroes. The code you have posted seems to be the one with the modified sin term that you say is failing. Perhaps you should do [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) on **both** the cuda API calls **and** the kernel call. I suspect in the failing case your kernel is not running, possibly a register issue. But you haven't told us anything about your GPU. — Robert Crovella, Jun 30 '13 at 20:46
The code you have posted doesn't use any shared memory, so I don't think you're running out of that. — Robert Crovella, Jun 30 '13 at 21:34
I have some strange observations when running this code on cc 2.1 device. `sin(t+ 0.0 * my_a0 * my_a0)` produces all zeros in the output, while `sin(t+0.0f * my_a0 * my_a0)` gives some values. Also, disabling GPU debug information produces correct output while enabling it gives zeros in the output iff division operations are present in the kernel. — sgarizvi, Jun 30 '13 at 21:39
Some of these variations affect registers per thread. When the total registers per thread times the threads requested exceeds the capability of the SM, the kernel launch will fail. Again, the reason for the "all zeroes" should be obvious with proper error-checking -- a failed kernel launch. — Robert Crovella, Jun 30 '13 at 22:19
You can also try using the [nvcc switch](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-gpu-code-generation) to limit registers per thread, e.g. `-maxrregcount=20` I assume this device is cc2.0 or higher since 1024 threads per block are specified. — Robert Crovella, Jun 30 '13 at 22:28
Robert Crovella, thanks for useful comments. When I changed the number of threads from 1024 to 512 the code started working, although I know for sure 1024 is allowed and works perfectly when I only have sin(t). The GPUs I believe are Fermis. I still don't have any idea why this happens, but it really looks like the kernel is not starting (doesn't produce any `printf`, for example). — Sleepyhead, Jul 01 '13 at 03:35
Ok, I changed all the numbers from 0.0 etc to 0.0f, 2.0f etc and now the code works perfectly (thanks sgar91). Can someone explain why this needs to be done (why float must be cast on every number?) — Sleepyhead, Jul 01 '13 at 04:40
As I've stated already, I believe you have an issue with registers per thread. All of the variations discussed may affect compiler code generation, which may affect register usage. When your register usage exceeds the maximum available, the kernel launch fails. Read the [nvcc manual](http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html) to learn how to display register usage statistics as well as how to limit the registers per thread. You should add the cuda error checking I mentioned as well. — Robert Crovella, Jul 01 '13 at 14:12

score 3 · Answer 1 · edited May 23 '17 at 11:50

I believe this is due to a register limitation.

In order to launch a kernel, the total registers per thread must not exceed the maximum limit (i.e. the Maximum number of 32-bit registers per thread which the compiler should guarantee) and the registers per thread times the number of threads requested must not exceed the maximum limit (the Number of 32-bit registers per multiprocessor).

In the cases where you are getting incorrect results, I believe your kernel is not launching for this reason (too many registers requested in total). You're not doing any cuda error checking, but if you did, I believe you could confirm this.

You can work around this using any method that reduces the total to under the limit. Obviously reducing the threads per block is a direct way to do this. Other things like specifying the -G switch to the compiler also affect code generation and therefore may affect register per thread. Another way to work around this is to instruct the compiler to limit its usage of registers to some maximum amount per thread. This is documented in the nvcc manual, the usage is like this:

nvcc -maxrregcount=xx  ... (rest of compile command)

Where xx is the number of registers per thread to limit the usage. If you limit it to let's say, 20 per thread, then even with 1024 threads per block, I will still only be using roughly 20K registers, and this will fit within any device that supports 1024 threads per block (cc 2.0 and above).

You can also get register usage statistics by asking the compiler to generate those:

nvcc -Xptxas -v ... (rest of compile command)

which will cause the compiler to generate a variety of statistics about resource usage, including the number of register per thread used/expected.

Note that resource usage, including register usage, may affect occupancy, which has implications for overall application performance. Therefore limiting register usage may not only allow the kernel to run, but may also allow multiple threadblocks to be resident on an SM, which generally suggests your occupancy is improved, which may improve the performance of your application.

Apparently the usage of 0.0 vs. 0.0f has some subtle effect on compiler behavior, which is showing up in code generation. I would also surmise that you may be right on the boundary of what is acceptable, so perhaps a small change in registers used per thread may be affecting what will run. You can investigate this further using the printout of resource usage statistics from the compiler that I referenced above, and/or possibly by inspecting the PTX code (an intermediate code, somthing like assembly code) generated:

nvcc -ptx ....

If you choose to inspect the PTX, you will want to refer to the PTX manual.

CUDA c program strange behaviour - kernel works until I add more operations

1 Answers1