I have read in various places that __device__
functions are almost always inlined by the CUDA compiler. Is it correct to say, then, that there is (generally) no increase in the number of registers used when I move code from a kernel into a__device__
function that is called by the kernel?
As an example, do the following snippets use the same number of registers? Are they equally efficient?
SNIPPET 1
__global__ void manuallyInlined(float *A,float *B,float *C,float *D,float *E) {
// code that manipulates A,B,C,D and E
}
SNIPPET 2
__device__ void fn(float *A,float *B,float *C,float *D,float *E) {
// code that manipulates A,B,C,D and E
}
__global__ void manuallyInlined(float *A,float *B,float *C,float *D,float *E) {
fn(A,B,C,D,E);
}