2

I have read in various places that __device__ functions are almost always inlined by the CUDA compiler. Is it correct to say, then, that there is (generally) no increase in the number of registers used when I move code from a kernel into a__device__ function that is called by the kernel?

As an example, do the following snippets use the same number of registers? Are they equally efficient?

SNIPPET 1

__global__ void manuallyInlined(float *A,float *B,float *C,float *D,float *E) {
    // code that manipulates A,B,C,D and E 
}

SNIPPET 2

__device__ void fn(float *A,float *B,float *C,float *D,float *E) {
    // code that manipulates A,B,C,D and E 
}


__global__ void manuallyInlined(float *A,float *B,float *C,float *D,float *E) {
    fn(A,B,C,D,E);
}
weemattisnot
  • 889
  • 5
  • 16

1 Answers1

5

The final answer can only be determined by using the tools (compile with -Xptxas -v, or use one of the profilers), but the general answer is that calling a __device__ function can impact the number of registers used (as well as performance, and efficiency).

Depending on your file organization, and how you compile your code, a __device__ function may be inlined. If it is inlined, this generally gives the optimizing compiler (ptxas, mainly) the best chance to adapt register usage as it sees fit. (Note that, at least in theory, this "adaptation" could result in either more or less registers used. However, the inlining case generally results in the compiler using both less registers and possibly higher performance. But the compiler primarily optimizes for higher performance, not less register usage.)

On the other hand, if it is not inlined, then it must be handled as an ordinary function call. Like many other computer architectures, a function call involves setting up a stack frame to pass variables, and then transferring control to the function. In this scenario, the compiler is more restricted because:

  1. It must move variables used by the function to/from the stack frame
  2. It cannot perform other optimizations based on "surrounding" code, because it does not know what the surrounding code is. The __device__ function must be handled in a standalone fashion by the compiler.

So if the function can be inlined, there should not be much difference between your two approaches. If the function cannot be inlined, then there will usually be a noticeable difference in register usage in the above two approaches.

Some obvious factors that may impact whether the compiler will attempt to inline a __device__ function are:

  1. If the __device__ function is in a separate compilation unit from the __global__ or other __device__ function that calls it. In this case, the only way this can work is via CUDA separate compilation and linking, also called device-linking. In such a scenario, the compiler will not (cannot) inline the function.

  2. If the __noinline__ compiler directive is specified. Note that this is only a hint to the compiler; it may be ignored.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257