I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe
), find most-significant non-sign bit (bfind
), and population count (popc
).
Is it more efficient to use them explicitly rather than write code with their intended semantics in C/C++?
For example: "population count", or popc
, means counting the one bits. So should I write:
__device__ int popc(int a) {
int d = 0;
while (a != 0) {
if (a & 0x1) d++;
a = a >> 1;
}
return d;
}
for that functionality, or should I, rather, use:
__device__ int popc(int a) {
int d;
asm("popc.u32 %1 %2;":"=r"(d): "r"(a));
return d;
}
? Will the inline PTX be more efficient? Should we write inline PTX to to get peak performance?
also - does GPU have some extra magic instruction corresponding to PTX instructions?