0

I am developing an application with CUDA in Ubuntu 16.04 using the cuda toolkit 8.0.

My problem is about the number of threads per block that a gtx960m (capability 5.0) can contains. I'm trying to use the max number of threads in each block and, for this reason, I'm using the cudaGetDeviceProperties() function to get this information (1024 threads, you can see in documentation) but when I use more than 512 threads per block in my kernel the the API returns the error code 0x7 ("warning: Cuda API error detected: cudaLaunch returned (0x7)") that means "Launch out of resources".

A little example code with my problem:

#include <random>
#include <curand.h>
#include <curand_kernel.h>

#define min(a,b) (a<b?a:b);

__global__ void bootstrap_V1(int nSamples, int sampleFraction, int seed, unsigned int* sampleIDs, unsigned int* inbagCounts){

  int tid = threadIdx.x + blockIdx.x * blockDim.x;
  int offset =  gridDim.x * blockDim.x;

  /*Generating a random number in a specific ranger:
    1- Use CURAND to generate a uniform distribution between 0.0 and 1.0
    2- Then multiply this by the desired range (largest value - smallest value + 0.999999).
    3- Then add the offset (+ smallest value).
    4- Then truncate to an integer.
  */
  curandState state;
  curand_init(seed, tid, 0, &state);
  while(tid < nSamples*sampleFraction){
    float randf = curand_uniform(&state);
    randf *= ((nSamples - 1) - 0) + 0.999999;
    randf += 0;
    int rand = (int)truncf(randf);

    sampleIDs[tid] = rand;
    atomicAdd(&(inbagCounts[rand]), 1);
    tid += offset;
  }
}

int main(void) {

  int nSamples = 100;
  int sampleFraction = 1;

  std::random_device rd;
  std::mt19937_64 gen(rd());
  std::uniform_int_distribution<size_t>dist;
  cudaError_t error;
  cudaDeviceProp prop;

  cudaGetDeviceProperties(&prop, 0);
  int blocks = prop.multiProcessorCount;
  int maxThreadsPerBlock = prop.maxThreadsPerBlock;
  int seed = dist(gen);

  unsigned int *sampleIDs = (unsigned int *)malloc(nSamples * sampleFraction * sizeof(int));
  unsigned int *inbagCounts = (unsigned int *)malloc(nSamples * sizeof(int));

  unsigned int *dev_sampleIDs, *dev_inbagCounts;
  error = cudaMalloc((void **)&dev_sampleIDs, nSamples*sampleFraction*sizeof(int));
  error = cudaMalloc((void **)&dev_inbagCounts, nSamples*sizeof(int));
  error = cudaMemset(dev_sampleIDs, 0, nSamples*sampleFraction*sizeof(int));
  error = cudaMemset(dev_inbagCounts, 0, nSamples*sizeof(int));
  if (error != cudaSuccess)
    printf("%s\n", cudaGetErrorString(error));

  int threadsPerBlock = min(maxThreadsPerBlock, nSamples);
  bootstrap_V1<<<blocks,threadsPerBlock>>>(nSamples, sampleFraction, seed, dev_sampleIDs, dev_inbagCounts);

  cudaMemcpy(sampleIDs, dev_sampleIDs, nSamples*sampleFraction*sizeof(int), cudaMemcpyDeviceToHost);
  cudaMemcpy(inbagCounts, dev_inbagCounts, nSamples*sizeof(int), cudaMemcpyDeviceToHost);

  free(sampleIDs);
  free(inbagCounts);
  cudaFree(dev_inbagCounts);
  cudaFree(dev_sampleIDs);
}

and here are the compilation lines I am using:

/usr/local/cuda-8.0/bin/nvcc -G -g -O0 -std=c++11 -gencode arch=compute_50,code=sm_50  -odir "." -M -o "main.d" "../main.cu
/usr/local/cuda-8.0/bin/nvcc -G -g -O0 -std=c++11 --compile --relocatable-device-code=false -gencode arch=compute_50,code=compute_50 -gencode arch=compute_50,code=sm_50  -x cu -o  "main.o" "../main.cu"
/usr/local/cuda-8.0/bin/nvcc --cudart static --relocatable-device-code=false -gencode arch=compute_50,code=compute_50 -gencode arch=compute_50,code=sm_50 -link -o  "prueba"  ./main.o

Can someone explain why this is happening to me? Thank you very much.

By request, PTAX Verbose:

ptxas info    : 77696 bytes gmem, 72 bytes cmem[3]
ptxas info    : Function properties for cudaDeviceGetAttribute
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Compiling entry function '_Z12bootstrap_V1iiiPjS_' for 'sm_50'
ptxas info    : Function properties for _Z12bootstrap_V1iiiPjS_
    6560 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 128 registers, 6560 bytes cumulative stack size, 352 bytes cmem[0], 8 bytes cmem[2]
ptxas info    : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessor
    32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN4dim3C2Ejjj
    16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads
ptxas info    : Function properties for cudaMalloc
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaGetDevice
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN4dim3C1Ejjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
    40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN61_INTERNAL_39_tmpxft_000013a3_00000000_7_main_cpp1_ii_055b743a9atomicAddEPjj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for cudaFuncGetAttributes
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
SolidusAbi
  • 43
  • 9
  • 2
    error 7 is "launch out of resources". Although it can be triggered if you increase thread count, it is not arising out of a fundamental limit on the threads per block. It is arising out of the structure of your code as it is handled by the compiler. For example, you may be requesting too many registers for the launch, which is triggered as you increase thread count. But it's impossible to say for sure, since you've not provided a [mcve] – Robert Crovella Oct 17 '16 at 14:06
  • I just uploaded a basic example where, if the number of samples exceeds 512, the problem appears. Thanks. – SolidusAbi Oct 17 '16 at 14:58
  • and what is the compile command line you are using for this code? what files are you including? – Robert Crovella Oct 17 '16 at 15:08
  • apologies for the lack of including, my fault. About the compile line, im using Nsight but I left the command line that NSight uses. – SolidusAbi Oct 17 '16 at 15:37
  • In Nsight, enable the option to generate verbose output from `ptxas`. Then recompile, and paste the full Nsight command output from compilation into your question. – Robert Crovella Oct 17 '16 at 16:01
  • also, you appear to be building a debug project in Nsight. Nothing wrong with that, but you may observe different behavior building a release project. This is due to the fact that the compiler may use differing numbers of registers per thread in debug (`-G`) vs. release modes. – Robert Crovella Oct 17 '16 at 16:40
  • I have already published the PTAX info. It is the first time that active PTAX --verbose so I'm not sure this is the information you want. About the debug project, i did not know that information, I'll keep that in mind from now. Thanks for your patience. – SolidusAbi Oct 17 '16 at 17:29

1 Answers1

6

Can someone explain why this is happening to me?

The proximal reason for your observation is contained in this output:

ptxas info    : Compiling entry function '_Z12bootstrap_V1iiiPjS_' for 'sm_50'
ptxas info    : Function properties for _Z12bootstrap_V1iiiPjS_
    6560 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 128 registers, 6560 bytes cumulative stack size, 352 bytes cmem[0], 8 bytes cmem[2]

The above output for the function bootstrap_V1 (i.e. your kernel) indicates that the compiler has chosen to use 128 registers per thread.

In order for a threadblock to launch, and therefore this kernel to work, the product of the number of registers per thread (128) and the total number of threads (e.g. 512, or 1024) gives the total number of registers required.

This number must be less than or equal to the available registers in the SM. The maximum available registers is given in table 13 of the CUDA programming guide (and also in the deviceQuery output for your GPU). For a cc 5.0 GPU, the maximum registers per SM is 64K. For your code, 128 registers per thread * 1024 threads yields 128K registers, which won't work and is leading the to the error you are seeing ("too many resources requested for launch"). If you did proper CUDA error checking in your code, you would receive this nice text-based error message, rather than the terse "0x7" error.

When you reduce the threads per block to 512, then the product is 64K, which works.

If you do a little searching on this topic ("CUDA too many registers per thread") you will find a great many treatments which are basically identical to what I have described above, by way of explanation.

Typical methods to address this limit the GPU compiler register usage. The best method is via use of launch bounds. A crude method also exists by using the -maxrregcount switch to the nvcc compiler. Specification of -maxrregcount is a selectable option within Nsight Eclipse Edition.

As a simple proof point to demonstrate that this is not actually due to a hard limit of 512 threads per block (which is not the case) you can set -maxrregcount to 63, and this code should then launch correctly.

Also note that you are building a debug project, and the compiler differences between debug and release projects may affect the registers-per-thread usage, as can a great many other factors.

It's not related to your question, but you may want to be aware of the fact that building debug projects is not recommended if you want maximum performance.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257