CUDA Funnel Shift code generation

Question

Does CUDA generate funnel shift instructions on SM35 internally when using 64-bit integers? My kernel compiles with a SM35 target to PTX which shows a regular shl.b64 instruction. I am aware this PTX output isn't fully optimized yet and native arch code will be generated once the module is loaded (cuModuleLoad).

 shl.b64     %rd5, %rd4, 16;

PS: I've seen some of the bitcoin mining code call intrinsics instead but that doesn't fully clarify my question... https://github.com/MiniblockchainProject/CudaMiner/blob/master/cuda_helper.h#L53 — FHoenig, Feb 09 '15 at 21:28
I'm not entirely sure what you're asking. Does http://docs.nvidia.com/cuda/parallel-thread-execution/#logic-and-shift-instructions-shl not answer your question? What about http://stackoverflow.com/questions/12767113/funnel-shift-what-is-it ? — Christian Sarofeen, Feb 09 '15 at 21:40
funnel shift (PTX) is documented [here](http://docs.nvidia.com/cuda/parallel-thread-execution/#logic-and-shift-instructions-shf). If you study that carefully, you'll note it only works on b32 operands. Yes, it can handle a 64-bit quantity, but it does that by concatenating two 32-bit operands, not by direct treatment of a 64-bit operand. Your question is unclear to me. You could certainly use this via inline PTX. Perhaps your question is "what sequence of CUDA C/C++ source code (if any) would nvcc compile down to a PTX shf instruction?" I'm not sure there is any that can be expressed in C/C++. — Robert Crovella, Feb 09 '15 at 22:12
"what sequence of CUDA C/C++ source code (if any) would nvcc compile down to a PTX shf instruction?" -- YES that would be one possible answer. But really I'm wondering about whether the final device code generated out of "shl.b64" would be a shf of the lower and upper 32-bit words. Not sure how to inspect that low level output after ptx... — FHoenig, Feb 09 '15 at 22:25
You can simply run `cuobjdump -sass yourfile.exe` to see the actual machine code. — void_ptr, Feb 09 '15 at 22:29
I'm using the Driver API compiling .cu to .ptx. The native code is generated on cuModuleLoad. — FHoenig, Feb 09 '15 at 22:39

Robert Crovella · Accepted Answer · 2015-02-09T23:40:13.203

The driver API should not be an obstacle to code analysis. Try writing a test case using the runtime API. Responding to this question in the comments:

But really I'm wondering about whether the final device code generated out of "shl.b64" would be a shf of the lower and upper 32-bit words.

I think it's possible in some situations that ptxas (including the driver JIT engine) can convert PTX logical shift instructions into the SASS equivalent (SHF) of the PTX funnel-shift instruction (shf).

Here's one fully-worked example:

$ cat t625.cu
#include <stdio.h>

__global__ void my_kernel(unsigned long long data)
{

  unsigned long long my_data = data >> 15;
  printf("data = %ld\n",my_data);

}

int main(){

  my_kernel<<<1,1>>>(2ULL<<40);
  cudaDeviceSynchronize();

}
[bob@cluster1 misc]$ nvcc -arch=sm_35 -ptx t625.cu
[bob@cluster1 misc]$ cat t625.ptx
*************EXCERPT**************

        // .globl       _Z9my_kernely
.visible .entry _Z9my_kernely(
        .param .u64 _Z9my_kernely_param_0
)
{
        .local .align 8 .b8     __local_depot6[8];
        .reg .b64       %SP;
        .reg .b64       %SPL;
        .reg .s32       %r<2>;
        .reg .s64       %rd<8>;


        mov.u64         %rd7, __local_depot6;
        cvta.local.u64  %SP, %rd7;
        ld.param.u64    %rd1, [_Z9my_kernely_param_0];
        add.u64         %rd2, %SP, 0;
        cvta.to.local.u64       %rd3, %rd2;
        shr.u64         %rd4, %rd1, 15;    ***** NOTE *****
        st.local.u64    [%rd3], %rd4;
        mov.u64         %rd5, $str;
        cvta.global.u64         %rd6, %rd5;
        // Callseq Start 0
        {
        .reg .b32 temp_param_reg;
        // <end>}
        .param .b64 param0;
        st.param.b64    [param0+0], %rd6;
        .param .b64 param1;
        st.param.b64    [param1+0], %rd2;
        .param .b32 retval0;
        call.uni (retval0),
        vprintf,
        (
        param0,
        param1
        );
        ld.param.b32    %r1, [retval0+0];

        //{
        }// Callseq End 0
        ret;
}
*************EXCERPT**************

$ nvcc -arch=sm_35 t625.cu -o t625
$ cuobjdump -sass t625

*************EXCERPT**************



Fatbin elf code:
================
arch = sm_35
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit

        code for sm_35
                Function : _Z9my_kernely
        .headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                  /* 0x0880109c10801000 */
        /*0008*/                   MOV R1, c[0x0][0x44];          /* 0x64c03c00089c0006 */
        /*0010*/                   ISUB R1, R1, 0x8;              /* 0xc0880000041c0405 */
        /*0018*/                   MOV R0, c[0x0][0x144];         /* 0x64c03c00289c0002 */
        /*0020*/                   MOV R3, c[0x0][0x140];         /* 0x64c03c00281c000e */
        /*0028*/                   MOV32I R4, 0x0;                /* 0x74000000001fc012 */
        /*0030*/                   LOP.OR R6, R1, c[0x0][0x24];   /* 0x62001000049c041a */
        /*0038*/                   MOV32I R5, 0x0;                /* 0x74000000001fc016 */
                                                                  /* 0x0880b80010a0109c */
        /*0048*/                   SHF.R.U64 R2, R3, 0xf, R0;        ***** NOTE *****
        /*0050*/                   SHF.R.U64.HI R3, RZ, 0xf, R0;     ***** NOTE *****
        /*0058*/                   LOP32I.AND R0, R6, 0xffffff;   /* 0x20007fffff9c1800 */
        /*0060*/                   MOV R7, RZ;                    /* 0xe4c03c007f9c001e */
        /*0068*/                   STL.64 [R0], R2;               /* 0x7aa80000001c000a */
        /*0070*/                   JCAL 0x0;                      /* 0x1100000000000100 */
        /*0078*/                   MOV RZ, RZ;                    /* 0xe4c03c007f9c03fe */
                                                                  /* 0x08000000000000b8 */
        /*0088*/                   EXIT;                          /* 0x18000000001c003c */
        /*0090*/                   BRA 0x90;                      /* 0x12007ffffc1c003c */
        /*0098*/                   NOP;                           /* 0x85800000001c3c02 */
        /*00a0*/                   NOP;                           /* 0x85800000001c3c02 */
        /*00a8*/                   NOP;                           /* 0x85800000001c3c02 */
        /*00b0*/                   NOP;                           /* 0x85800000001c3c02 */
        /*00b8*/                   NOP;                           /* 0x85800000001c3c02 */
                ..............................



Fatbin ptx code:
================
arch = sm_35
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
[bob@cluster1 misc]$


*************EXCERPT**************

In the lines marked by ***** NOTE *****, the PTX shr instruction (non-funnel-shift) is being converted to SASS SHF (funnel-shift) instructions.

Thanks so much Robert for taking the time to write this. – FHoenig Feb 09 '15 at 23:33 — FHoenig, Feb 09 '15 at 23:33

CUDA Funnel Shift code generation

1 Answers1