PTX is an intermediary representation for compiling C/C++ GPU code into, eventually, individual micro-architecture's SASS assembly language. Thus it is not supposed to be encumbered by specific holes/gaffs/flukes/idiosyncrasies in the actual instruction sets of specific nVIDIA GPU micro-architectures.
Now, PTX has an instruction for counting the number leading zeros in a register: clz
. Yet - it lacks a corresponding ctz
instruction, which counts the number trailing zeros. These operations are 'symmetric' and one would certainly expect to see either both or none in an instruction set - again, especially if its abstract and not bound to what's available on a specific piece of hardware. Popular CPU architectures have had both for many years.
Strangely enough, the CUDA header device_functions.h
declares the function
* \brief Find the position of the least significant bit set to 1 in a 32 bit integer.
*
* [etc.]
*
* \return Returns a value between 0 and 32 inclusive representing the position of the first bit set.
* - __ffs(0) returns 0.
*/
__DEVICE_FUNCTIONS_DECL__ __device_builtin__ int __ffs(int x);
This function:
- has almost the same semantics as count-trailing-zeros - only differing on an all-zero input.
- does not translate into a single PTX instruction, but rather two: bitwise negation, then a
clz
. - is also missing its potential counterpart,
__fls
- find last set.
So, why is that? Why is an apparently obvious-to-have instruction missing from PTX, and a "fake builtin" that's almost identical to it present in the headers?