CUDA, code works on one GPU machine, doesn't work on another

Question

I am puzzled by the behaviour of the following snippet:

 #include <stdio.h>

// kernel
__global__ void CheckAddressing(float * d_Result, int numCols, int numRows)
{
    printf("%d\n", threadIdx.x);
    if(threadIdx.x<16)
    {
        d_Result[threadIdx.x]=float(364.66);

    }
}


////////

int main(int argc, char ** argv)
{
    int TotalSize = 16;
    float * d_Result;

    float * h_Result;

        cudaSetDevice(0);

    h_Result = (float *)malloc(TotalSize*sizeof(float));
    cudaMalloc((void **) &d_Result, TotalSize*sizeof(float));
    CheckAddressing<<<dim3(1),dim3(16)>>>(d_Result, 8,8);


    cudaMemcpy(h_Result, d_Result, TotalSize*sizeof(float), cudaMemcpyDeviceToHost);

    for(int n=0; n<16; n++)
    {
        printf("%f\t", h_Result[n]);
    }
        printf("\n");


// free GPU memory
        cudaFree(d_Result);
        free(h_Result);
    return 0;

}

It works on one machine (I compile with nvcc -arch=sm_30) and returns 364.66 (16 times). However on another machine running Cuda 5.5 it returns all zeros. Any idea what can be happening?

UPDATE:

cuda-memcheck ./test
========= CUDA-MEMCHECK
0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    
========= ERROR SUMMARY: 0 errors

nvidia-smi
Fri Apr 18 14:45:05 2014       
+------------------------------------------------------+                       
| NVIDIA-SMI 331.44     Driver Version: 331.44         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20Xm         Off  | 0000:02:00.0     Off |                    0 |
| N/A   20C    P0    50W / 235W |     11MiB /  5759MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

The easy way to find out would be to add some error checking. Every API call (and the kernel launch) returns or reports a status. Checking those will explain what is happening. — talonmies, Apr 18 '14 at 20:44
I am doing `cudamemcheck` when I run the code. Does it suffice to check for errors? If not, what is the proper way of getting the reports of errors? — Sleepyhead, Apr 18 '14 at 21:05
you run the code on the machine that is returning all zeroes, and `cuda-memcheck` is reporting zero errors? Can you show your actual invocation and actual output from that machine? (paste it into your question). What is the machine configuration and OS? — Robert Crovella, Apr 18 '14 at 21:19
Robert, the machine is Dirac @ Nersc. I land on one of the nodes interactively (http://www.nersc.gov/users/computational-systems/testbeds/dirac/), compile the code, and just run it in the same interactive mode. I am also in contact with nersc support, but I thought I might have mistake in the code, doesn't look like that. if I don't do the `malloc` on the host array and just do `float h_Result[16]`, the same code gives random numbers — Sleepyhead, Apr 18 '14 at 21:31
Robert, yes, `cuda-memcheck` reports 0 errors and output is all zeros — Sleepyhead, Apr 18 '14 at 21:33

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

1

Dirac mentions Fermi GPUs on its banner. If you are on a node with Fermi GPUs, your compile command is incorrect:

-arch=sm_30

is used for Kepler GPUs.

Try:

-arch=sm_20

instead.

I was confused by the fact that cuda-memcheck was reporting no errors, but the type of error you are encountering is a type that cuda-memcheck will not necessarily catch. Specifically, there are a category of launch failure errors that can only be trapped by the proper cuda error checking that @talonmies suggested. Specifically note the error checking code that is required immediately after a kernel launch.

When you compile for -arch=sm_30 and try to run it on a Fermi (sm_20) machine, the kernel launch will immediately fail, but all other subsequent CUDA API calls will report no failure.

The detail page for Dirac does mention a couple Kepler nodes/GPUs:

•1 node: Tesla K20Xm

•1 node: Tesla K40c

I believe your code compiled with -arch=sm_35 should run correctly on those nodes.

And I also note that there are even some older ("Tesla" family) GPUs/nodes:

•4 nodes: 1 C1060 NVIDIA Tesla GPU with 4GB of memory and 240 parallel CUDA processor cores.

• 1 node: 4 C1060 Nvidia Tesla GPU's, each with 4GB of memory and 240 parallel CUDA processor cores.

For those nodes, you would need to compile with:

-arch=sm_13

but don't forget to use the proper cuda error checking, any time you are having difficulty with a CUDA code.

Or you could use nvcc extended notation to specify a compile and binary/executable for all 3 types.

Using extended notation, for the 3 different GPU architectures on that cluster (that I can see):

nvcc -gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 ...

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 18 '14 at 21:45

Robert Crovella

143,785
11
213
257

I tried it with both `-arch=sm_20` and without specifying arch. Still get all zeros. This is very confusing – Sleepyhead Apr 18 '14 at 21:49
It will be confusing since you have at least 3 different node types in that machine, and you may want a code that runs on any of the 3 types. For that I suggest using the extended notation. – Robert Crovella Apr 18 '14 at 22:01
for the kepler node you are on (according to your `nvidia-smi` output) you could just try `-arch=sm_35` also. – Robert Crovella Apr 18 '14 at 22:09
The extended notation that you mentioned in your answer (last line) finally gave me the result I need. Thank you, Robert, I really appreciate your help! – Sleepyhead Apr 18 '14 at 22:10
FWIW since you have `printf` in the kernel, your code can't be compiled as-is for an `sm_13` machine and so you should probably delete that `-gencode` entry. You probably already figured that out. – Robert Crovella Apr 18 '14 at 22:19
To be honest, I am still not sure why the code started working when I submit it via the queue using the extended notation. I have still not managed to get successful results on the node itself interactively (I tried all -arch choices) – Sleepyhead Apr 18 '14 at 22:25
1

Yes, I do think something else is going on. Adding the cuda error checking might be instructive. There may be a machine configuration difference (modules, etc) between the interactive and non-interactive jobs. A K20 GPU should be able to run code that has been compiled with `-arch=sm_20`, or `-arch=sm_30`, or `-arch=sm_35`. I tried it myself just now with your code to confirm. – Robert Crovella Apr 18 '14 at 22:33

CUDA, code works on one GPU machine, doesn't work on another

1 Answers1