1

Context: I am currently learning how to properly use CUDA, in particular how to generate random numbers using CURAND. I learned here that it might be wise to generate my random numbers directly when I need them, inside the kernel which performs the core calculation in my code.

Following the documentation, I decided to play a bit and try come up with a simple running piece of code which I can later adapt to my needs.

I excluded MTGP32 because of the limit of 256 concurrent threads in a block (and just 200 pre-generated parameter sets). Besides, I do not want to use doubles, so I decided to stick to the default generator (XORWOW).

Problem: I am having a hard time understanding why the same seed value in my code is generating different sequences of numbers for a number of threads per block bigger than 128 (when blockSize<129, everything runs as I would expect). After doing proper CUDA error checking, as suggested by Robert in his comment, it is somewhat clear that hardware limitations play a role. Moreover, not using "-G -g" flags at compile time raises the "trouble for threshold" from 128 to 384.

Questions: What exactly is causing this? Robert worte in his comment that "it might be a registers per thread issue". What does this mean? Is there an easy way to look at the hardware specs and say where this limit will be? Can I get around this issue without having to generate more random numbers per thread?

A related issue seems to have been discussed here but I do not think it applies to my case.

My code (see below) was mostly inspired by these examples.

Code:

    #include <stdio.h>
    #include <stdlib.h>
    #include <cuda.h>
    #include <curand_kernel.h>

    #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
    inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true){
        if (code != cudaSuccess){ 
           fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
           if (abort) exit(code);
        }
    }

    __global__ void setup_kernel(curandState *state, int seed, int n){

        int id = threadIdx.x + blockIdx.x*blockDim.x;

        if(id<n){
            curand_init(seed, id, 0, &state[id]);
        }
    }

    __global__ void generate_uniform_kernel(curandState *state, float *result, int n){

        int id = threadIdx.x + blockIdx.x*blockDim.x;
        float x;

        if(id<n){
            curandState localState = state[id];
            x = curand_uniform(&localState);
            state[id] = localState;
            result[id] = x; 
        }
    }

    int main(int argc, char *argv[]){

        curandState *devStates;
        float *devResults, *hostResults;

        int n = atoi(argv[1]);
        int s = atoi(argv[2]);
        int blockSize = atoi(argv[3]);

        int nBlocks = n/blockSize + (n%blockSize == 0?0:1);

        printf("\nn: %d, blockSize: %d, nBlocks: %d, seed: %d\n", n, blockSize, nBlocks, s);

        hostResults = (float *)calloc(n, sizeof(float));
        cudaMalloc((void **)&devResults, n*sizeof(float));

        cudaMalloc((void **)&devStates, n*sizeof(curandState));
        setup_kernel<<<nBlocks, blockSize>>>(devStates, s, n);
        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk( cudaDeviceSynchronize() );

        generate_uniform_kernel<<<nBlocks, blockSize>>>(devStates, devResults, n);
        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk( cudaDeviceSynchronize() );

        cudaMemcpy(hostResults, devResults, n*sizeof(float), cudaMemcpyDeviceToHost);

        for(int i=0; i<n; i++) {
            printf("\n%10.13f", hostResults[i]);
        }

        cudaFree(devStates);
        cudaFree(devResults);
        free(hostResults);

        return 0;
    }

I compiled two binaries, one using the "-G -g" debugging flags and the other without. I named them rng_gen_d and rng_gen, respectively:

     $ nvcc -lcuda -lcurand -O3 -G -g --ptxas-options=-v rng_gen.cu -o rng_gen_d
    ptxas /tmp/tmpxft_00002257_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float
    ptxas info    : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
    ptxas info    : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
    ptxas info    : Used 43 registers, 32 bytes smem, 72 bytes cmem[1], 6480 bytes lmem
    ptxas info    : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
    ptxas info    : Used 10 registers, 36 bytes smem, 40 bytes cmem[1], 48 bytes lmem

     $ nvcc -lcuda -lcurand -O3 --ptxas-options=-v rng_gen.cu -o rng_gen
    ptxas /tmp/tmpxft_00002b73_00000000-5_rng_gen.ptx, line 533; warning : Double is not supported. Demoting to float
    ptxas info    : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
    ptxas info    : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
    ptxas info    : Used 20 registers, 32 bytes smem, 48 bytes cmem[1], 6440 bytes lmem
    ptxas info    : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
    ptxas info    : Used 19 registers, 36 bytes smem, 4 bytes cmem[1]

To start with, there is a strange warning message at compile time (see above):

    ptxas /tmp/tmpxft_00002b31_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float

Some debugging showed that the line causing this warning is:

    curandState localState = state[id];

There are no doubles declared, so I do not know exactly how to solve this (or even if this needs solving).

Now, an example of the (actual) problem I am facing:

     $ ./rng_gen_d 5 314 127

    n: 5, blockSize: 127, nBlocks: 1, seed: 314

    0.9151657223701
    0.3925153017044
    0.7007563710213
    0.8806988000870
    0.5301177501678

     $ ./rng_gen_d 5 314 128

    n: 5, blockSize: 128, nBlocks: 1, seed: 314

    0.9151657223701
    0.3925153017044
    0.7007563710213
    0.8806988000870
    0.5301177501678

     $ ./rng_gen_d 5 314 129

    n: 5, blockSize: 129, nBlocks: 1, seed: 314
    GPUassert: too many resources requested for launch rng_gen.cu 54

Line 54 is gpuErrchk() right after setup_kernel().

With the other binary (no "-G -g" flags at compile time), the "threshold for trouble" is raised to 384:

     $ ./rng_gen 5 314 129

    n: 5, blockSize: 129, nBlocks: 1, seed: 314

    0.9151657223701
    0.3925153017044
    0.7007563710213
    0.8806988000870
    0.5301177501678

     $ ./rng_gen 5 314 384 

    n: 5, blockSize: 384, nBlocks: 1, seed: 314

    0.9151657223701
    0.3925153017044
    0.7007563710213
    0.8806988000870
    0.5301177501678

     $ ./rng_gen 5 314 385

    n: 5, blockSize: 385, nBlocks: 1, seed: 314
    GPUassert: too many resources requested for launch rng_gen.cu 54

Finally, should this be somehow related to the hardware I am using for this preliminary testing (the project will be later launched on a much more powerful machine), here are the specs of the card I am using:

    ./deviceQuery Starting...

     CUDA Device Query (Runtime API) version (CUDART static linking)

    Detected 1 CUDA Capable device(s)

    Device 0: "Quadro NVS 160M"
      CUDA Driver Version / Runtime Version          5.5 / 5.5
      CUDA Capability Major/Minor version number:    1.1
      Total amount of global memory:                 256 MBytes (268107776 bytes)
      ( 1) Multiprocessors, (  8) CUDA Cores/MP:     8 CUDA Cores
      GPU Clock rate:                                1450 MHz (1.45 GHz)
      Memory Clock rate:                             702 Mhz
      Memory Bus Width:                              64-bit
      Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
      Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
      Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512 layers
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       16384 bytes
      Total number of registers available per block: 8192
      Warp size:                                     32
      Maximum number of threads per multiprocessor:  768
      Maximum number of threads per block:           512
      Max dimension size of a thread block (x,y,z): (512, 512, 64)
      Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
      Maximum memory pitch:                          2147483647 bytes
      Texture alignment:                             256 bytes
      Concurrent copy and kernel execution:          No with 0 copy engine(s)
      Run time limit on kernels:                     Yes
      Integrated GPU sharing Host Memory:            No
      Support host page-locked memory mapping:       Yes
      Alignment requirement for Surfaces:            Yes
      Device has ECC support:                        Disabled
      Device supports Unified Addressing (UVA):      No
      Device PCI Bus ID / PCI location ID:           1 / 0
      Compute Mode:
         < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = Quadro NVS 160M
    Result = PASS

And this is it. Any guidance on this matter will most welcome. Thanks!

EDIT:

1) Added proper cuda error checking, as suggested by Robert.

2) Deleted the cudaMemset line, which was useless anyway.

3) Compiled and ran the code without the "-G -g" flags.

4) Updated the output accordingly.

Community
  • 1
  • 1
dd_rlwll
  • 313
  • 5
  • 19
  • 2
    When I run your code, I get the same results regardless of the 3rd command line parameter. Please add proper [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code and retest. Also,your usage of `cudaMemset` is wrong. It initializes bytes just like `memset`, so it can't be sensibly used to initialize `float` quantities. Review the [documentation](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge07c97b96efd09abaeb3ca3b5f8da4ee). – Robert Crovella Nov 02 '13 at 04:12
  • I cannot reproduce this problem on a very similar GPU with CUDA 5. Whatever the problem is, it is not in the code you have shown. – talonmies Nov 02 '13 at 10:09
  • 1
    It might be a registers per thread issue, which is why I suggested proper cuda error checking. It's not obvious, but if it's compiled for `sm_10` or `sm_11` and with `-G`, the register usage for the setup kernel is up close to the limit of 8192 for this device. – Robert Crovella Nov 02 '13 at 12:28
  • @Robert, I edited my original post to include proper cuda error checking. The usage of cudaMemset was indeed wrong but the line was useless anyway, so I deleted it. I was surprised to see that the debugging flags at compile time seem to reduce the limit by a factor of 3 (from 384 to 128) in my hardware configuration. I expected it to fail only when the 3rd command line parameter was above 512, which is supposedly the "maximum number of threads per block". – dd_rlwll Nov 02 '13 at 14:01
  • @talonmies, I believe it is down to hardware limitations on my end. However, I would like to know how to identify these limitations beforehand, so that I can structure the rest of the code accordingly. – dd_rlwll Nov 02 '13 at 14:03
  • @RobertCrovella, Regarding efficiency, will I need to worry about the cuda error checking in the final code to run? What kind of an hit should I expect in running time? – dd_rlwll Nov 02 '13 at 19:32

1 Answers1

2

First of all, when you're having trouble with CUDA code, it's always advisable to do proper cuda error checking. It will eliminate a certain amount of head scratching, probably save you some time, and will certainly improve the ability of folks to help you on sites like this one.

Now you've discovered you have a registers per thread issue. The compiler while generating code will use registers for various purposes. Each thread requires this complement of registers to run it's thread code. When you attempt to launch a kernel, one of the requirements that must be met is that the number of registers required per thread times the number of requested threads in the launch must be less than the total number of registers available per block. Note that the number of registers required per thread may have to be rounded up to some granular allocation increment. Also note that the number of threads requested will normally be rounded up to the next higher increment of 32 (if not evenly divisible by 32) as threads are launched in warps of 32. Also note that the max registers per block varies by compute capability, and this quantity can be inspected via the deviceQuery sample as you've shown. Also as you've discovered, certain command line switches like -G can affect how nvcc utilizes registers.

To get advance notice of these types of resource issues, you can compile your code with additional command line switches:

nvcc -arch=sm_11 -Xptxas=-v -o mycode mycode.cu

The -Xptxas=-v switch will generate resource usage output by the ptxas assembler (which converts intermediate ptx code to sass assembly code, i.e. machine code), including registers required per thread. Note that the output will be delivered per kernel in this case, as each kernel may have it's own requirements. You can get more info about the nvcc compiler in the documentation.

As a crude workaround, you can specify a switch at compile time to limit all kernel compilation to a max register usage number:

nvcc -arch=sm_11 -Xptxas=-v -maxrregcount=16 -o mycode mycode.cu

This would limit each kernel to using no more than 16 registers per thread. When multiplied by 512 (the hardware limit of threads per block for a cc1.x device) this yields a value of 8192, which is the hardware limit on total registers per threadblock for your device.

However the above method is crude in that it applies the same limit to all kernels in your program. If you wanted to tailor this to each kernel launch (for example if different kernels in your program were launching different numbers of threads) you could use the launch bounds methodology, which is described here.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I actually want different sequences for different threads but I want the sequences to be the same if the seed value is the same when I run the program. Surprisingly, this seems to be the case only when the size of the blocks is 128 threads or fewer. I included a couple of examples in my original post which should make the problem clear. Maybe you can give it another read? Thanks for the input. – dd_rlwll Nov 02 '13 at 01:24
  • Although I can't test this right now, I have some quick questions: 1) Is there a difference between the flags "-Xptxas=-v" and "--ptxas-options=-v"? 2) Regarding the debugging flags, the output of nvcc using "--ptxas-options=-v" (see my post) shows that with "-G -g" the compiler uses 43 registers for setup_kernel() and without uses 20 registers only. However, 128*43=5504 and 20*384=7680, both less than 8192. Isn't this odd? 3) What are/can be the consequences of manually setting the number of registers per thread manually? Should I expect efficiency loss, for example? – dd_rlwll Nov 02 '13 at 19:30
  • 1
    1.) No. (read the doc link I provided) 2.) You didn't read my answer about granularity. The correct multiplication is *at least* 128*48 or 160*48, because registers get allocated in chunks and so do threads. And there is probably some register overhead. Beyond that, not sure what you mean by "odd". 3.) The compiler will generate different code. This code may or may not be slower. If you want to focus carefully on tuning, you could trade off number of threads for registers (and perhaps also occupancy), and see what kind of results you get. – Robert Crovella Nov 02 '13 at 22:17