Is it possible to put assembly instructions into CUDA code?

Question

I want to use assembly code in CUDA C code in order to reduce expensive executions as we do using asm in c programming.

Is it possible?

Related on SU: http://superuser.com/questions/668019/how-do-device-driver-instructions-program-the-gpu For Intel's Xeon Phi it looks possible. — Ciro Santilli OurBigBook.com, Nov 05 '15 at 16:15
Possible duplicate of [How to create or manipulate GPU assembler?](http://stackoverflow.com/questions/4660974/how-to-create-or-manipulate-gpu-assembler) — Ciro Santilli OurBigBook.com, Mar 16 '16 at 20:13

score 20 · Answer 1 · answered Aug 15 '11 at 23:46

Since CUDA 4.0, inline PTX is supported by the CUDA toolchain. There is a document in the toolkit that describes it: Using_Inline_PTX_Assembly_In_CUDA.pdf

Below is some code demonstrating use of inline PTX in CUDA 4.0. Note that this code should not be used as a replacement for CUDA's built-in __clz() function, I merely wrote it to explore aspects of the new inline PTX capability.

__device__ __forceinline__ int my_clz (unsigned int x)
{
    int res;

    asm ("{\n"
         "        .reg .pred iszero, gezero;\n"
         "        .reg .u32 t1, t2;\n"
         "        mov.b32         t1, %1;\n"
         "        shr.u32         %0, t1, 16;\n"
         "        setp.eq.b32     iszero, %0, 0;\n"
         "        mov.b32         %0, 0;\n"
         "@iszero shl.b32         t1, t1, 16;\n"
         "@iszero or.b32          %0, %0, 16;\n"
         "        and.b32         t2, t1, 0xff000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 8;\n"
         "@iszero or.b32          %0, %0, 8;\n"
         "        and.b32         t2, t1, 0xf0000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 4;\n"
         "@iszero or.b32          %0, %0, 4;\n"
         "        and.b32         t2, t1, 0xc0000000;\n"
         "        setp.eq.b32     iszero, t2, 0;\n"
         "@iszero shl.b32         t1, t1, 2;\n"
         "@iszero or.b32          %0, %0, 2;\n"
         "        setp.ge.s32     gezero, t1, 0;\n"
         "        setp.eq.b32     iszero, t1, 0;\n"
         "@gezero or.b32          %0, %0, 1;\n"
         "@iszero add.u32         %0, %0, 1;\n\t"
         "}"
         : "=r"(res)
         : "r"(x));
    return res;
}

score 4 · Accepted Answer · edited Aug 14 '12 at 11:02

4

No, you can't, there is nothing like the asm constructs from C/C++. What you can do is tweak the generated PTX assembly and then use it with CUDA.

See this for an example.

But for GPUs, assembly optimizations are NOT necessary, you should do other optimizations first, such as memory coalescency and occupancy. See the CUDA Best Practices guide for more information.

edited Aug 14 '12 at 11:02

hippietrail

15,848
18
99
158

answered Sep 09 '10 at 14:10

Dr. Snoopy

55,122
7
121
140

2

Second that! In my experience, CUDA programs are almost always memory bound, not compute bound. – mch Sep 09 '10 at 14:16
thanks above both. I just wanted to reduce the number of division and modulo operations, but now I will focus on the memory issue. – superscalar Sep 10 '10 at 02:02
Note, if you're compiling against the newest architecture (using the flag -arch sm_20), the newest API is now fully?? compliant with IEEE floating point specifications for division and square root. If you've got a bunch of divisions and you're also using -arch sm_20, then you might consider switching back to the "less" compliant version for a performance gain using the flag: __-prec-div=false__ http://forums.nvidia.com/lofiversion/index.php?t170749.html – M. Tibbits Sep 23 '10 at 20:11
3

Suggest unaccepting this answer and accepting njuffa's, since time has made this answer less useful thanks to new features. – harrism Sep 26 '12 at 00:37

Is it possible to put assembly instructions into CUDA code?

2 Answers2

Linked