Is inline PTX more efficient than C/C++ code?

Question

I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe), find most-significant non-sign bit (bfind), and population count (popc).

Is it more efficient to use them explicitly rather than write code with their intended semantics in C/C++?

For example: "population count", or popc, means counting the one bits. So should I write:

__device__ int popc(int a) {
  int d = 0;
  while (a != 0) {
    if (a & 0x1)  d++;
    a = a >> 1;
  }   
  return d;
}

for that functionality, or should I, rather, use:

__device__ int popc(int a) {
    int d;
    asm("popc.u32 %1 %2;"："=r"(d): "r"(a));
    return d;
}

? Will the inline PTX be more efficient? Should we write inline PTX to to get peak performance?

also - does GPU have some extra magic instruction corresponding to PTX instructions?

possible duplicate of [Should I look into PTX to optimize my kernel?](http://stackoverflow.com/questions/8080956/should-i-look-into-ptx-to-optimize-my-kernel) — Vitality, Sep 02 '13 at 08:59
@JackOLantern It may do some different between these questions. I'd like to know if compiler is smart enough to know what i want to do is just count the bit number.. And i have updated question. — luxuia, Sep 02 '13 at 09:28
Please note that CUDA provides device functions `__popc()` and `__popcll()`, so there is no need for hand-coded inline PTX in this case. These intrinsics are hardware-accelerated on sm_20 and later platforms. In general, look for a device function first (e.g. `__ffs` and `__clz()` for finding least / most significant 1-bit) before diving down to PTX level. Some PTX operations are not exposed through intrinsics, e.g. integer multiply-add with carry, which is a building block for multi-precision integer multiplies. — njuffa, Sep 02 '13 at 10:55

score 2 · Accepted Answer · answered Sep 02 '13 at 12:18

The compiler may identify what you're doing and use a fancy instruction to do it, or it may not. The only way to know in the general case is to look at the output of the compilation in ptx assembly, by using -ptx flag added to nvcc. If the compiler generates it for you, there is no need to hand-code the inline assembly yourself (or use an instrinsic).

Also, whether or not it makes a performance difference in the general case depends on whether or not the code path is used in a significant way, and on other factors such as the current performance limiters of your kernel (e.g. compute-bound or memory-bound).

Exactly. Analysis of compiler emitted PTX and disassembly of SASS are the only way to know what idiom analysis there is in the compiler and whether it actually makes any difference to performance. — talonmies, Sep 02 '13 at 14:09

score 0 · Answer 2 · answered Jan 09 '17 at 18:47

A few more points in addition to @RobertCrovella's answer:

Even if you do use PTX for something - that should happen rarely. Limit it to small functions of no more than a few PTX lines - which you can then re-use for multiple purposes as you see fit, with most of your coding being in C/C++.
An example of this principle are the intrinsics @njuffa mentiond, in (that's not an official copy of the file I think). Please read it through to see which intrinsics are available to you. That doesn't mean you should use them all, of course.
For your specific example - you do want the PTX over the first version; it certainly won't do any harm. But, again, it is also an example of how you do not need to actually write PTX, since popc has a corresponding __popc intrinsic (again, as @njuffa has noted).
You might also want to have a look at the source code of some CUDA-based libraries to see what kind of PTX snippets they've chosen to use.

Is inline PTX more efficient than C/C++ code?

2 Answers2