Is inline PTX assembly code powerful?

Question

I saw some code samples where people use inline PTX assembly code in C code. Doc in CUDA toolkit mentions that PTX is powerful, why is it so? What advantage we get if we use such codes in our C code?

This question is a bit like asking "is a piece of string long?". There isn't a definitive answer. There *might* be some circumstances where having explicit control over what PTX instructions are emitted by the compiler is advantageous or necessary, and other circumstances where it isn't. PTX is still only an intermediate representation of the code the GPU will run. — talonmies, Sep 16 '12 at 16:47
I was expecting some example where you can show power of PTX, that would have convinced me. I accept that it is generic question but I need some example to where I can convince myself that using PTX gives you some extra power that CUDA-C cannot give — username_4567, Sep 16 '12 at 16:54
Inline PTX gives you access to instructions not exposed via CUDA intrinsincs, and lets you apply optimizations that are either lacking in the compiler or prohibited by language specifications. For a worked example where use of inline PTX is advantageous, see: http://stackoverflow.com/questions/6162140/128-bit-integer-on-cuda/6220499#6220499 — njuffa, Sep 16 '12 at 21:37
@ArchaeaSoftware: Thanks, done. I missed the clarification from the original poster that he is just looking for a specific example, and my pointer to the 128-bit arithmetic via inline PTX should fit the bill nicely. — njuffa, Sep 17 '12 at 05:28

score 11 · Answer 1 · edited May 23 '17 at 10:24

Inline PTX gives you access to instructions not exposed via CUDA intrinsincs, and lets you apply optimizations that are either lacking in the compiler or prohibited by language specifications. For a worked example where use of inline PTX is advantageous, see: 128 bit integer on cuda?

The 128-bit addition using inline PTX requires just four instructions, since it has direct access to the carry flag. As a HLL, C/C++ does not have a representation for a carry flag, as a given hardware platform may have no carry flag (e.g. MIPS), a single carry flag (e.g. x86, sm_2x), or even multiple carry flags. In contrast to the 4-instruction PTX versions of 128-bit addition and subtraction, these operations might be coded in C as follows:

#define SUBCcc(a,b,cy,t0,t1,t2) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t2=t1<t0, cy=cy+t2, t1-t0)
#define SUBcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), cy=t1<t0, t1-t0)
#define SUBC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t1-t0)
#define ADDCcc(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)
#define ADDcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)
#define ADDC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t0+t1)

unsigned int cy, t0, t1, t2;

res.x = ADDcc  (augend.x, addend.x, cy, t0, t1);
res.y = ADDCcc (augend.y, addend.y, cy, t0, t1);
res.z = ADDCcc (augend.z, addend.z, cy, t0, t1);
res.w = ADDC   (augend.w, addend.w, cy, t0, t1);

res.x = SUBcc  (minuend.x, subtrahend.x, cy, t0, t1);
res.y = SUBCcc (minuend.y, subtrahend.y, cy, t0, t1, t2);
res.z = SUBCcc (minuend.z, subtrahend.z, cy, t0, t1, t2);
res.w = SUBC   (minuend.w, subtrahend.w, cy, t0, t1);

The addition and subtraction above probably compile to about three to four times the number of instructions used by the corresponding inline PTX version.

What about performance? Does such code injection boost up performance? — username_4567, Sep 17 '12 at 07:13
In the example I pointed to, use of inline PTX minimizes the number of instructions required. This helps app performance if the app is instruction throughput limited. Obviously one would want to use such a low-level interface only where it provides a noticeable benefit, but that applies to the use of inline assembly on any platform. Standard caveats about premature optimization apply. — njuffa, Sep 17 '12 at 08:53

Is inline PTX assembly code powerful?

1 Answers1

Linked