3

I compiled a kernel in NVRTC:

__global__ void kernel_A(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx / 32;
    unsigned char lane_id = idx % 32;
    /* ... */
}

I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:

__global__ void kernel_B(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx >> 5;
    unsigned char lane_id = idx & 31;
    /* ... */
}

it seems kernel_B just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof shows kernel_A runs for 15.2us in average, while kernel_B runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.

The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram() is -arch compute_50.

Is this expected?

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Kh40tiK
  • 2,276
  • 19
  • 29
  • You don't need to speculate. Run the executables through `cuobjdump -sass` to find out. – tera Jun 01 '17 at 07:44
  • @tera I'm doing JIT compilation via NVRTC so there's no executable file. Any way of getting assembly dumped? – Kh40tiK Jun 01 '17 at 07:48
  • 3
    NVRTC emits PTX and passes it to the driver for JIT compilation. You can extract the PTX with `nvrtcGetPTX `. Then you don't need to speculate – talonmies Jun 01 '17 at 08:31
  • @talonmies Thanks for the tip, will try to investigate more. – Kh40tiK Jun 01 '17 at 08:32
  • Not an answer to your question, but you should actually not bother with either of these; the warp and lane id are available [for free](https://stackoverflow.com/questions/44337309/whats-the-most-efficient-way-to-calculate-the-warp-id-and-lane-id-in-a-1-dimens). – einpoklum Jun 02 '17 at 20:56
  • 3
    I think its unlikely there is any difference between the two cases shown. When compiling all the way to SASS with nvcc, the produced code (both PTX **and** SASS) in each case is identical. I think it's unlikely that nvcc source->PTX followed by JIT of PTX->SASS would be any different. Since a full repro case is not given in this question, my guess is that the difference lies somewhere else. – Robert Crovella Jun 03 '17 at 19:17

1 Answers1

2

Did a thorough bugsweep in the codebase. Turns out my app was built in DEBUG mode. This causes additional flags -G and -lineinfo passed to nvrtcCompileProgram()

From nvcc man page:

--device-debug (-G)

Generate debug information for device code. Turns off all optimizations. Don't use for profiling; use -lineinfo instead.

Kh40tiK
  • 2,276
  • 19
  • 29