Why does CUDA PTX have clz but no ctz, and CUDA headers have "fake ffs" but no fls?

Question

PTX is an intermediary representation for compiling C/C++ GPU code into, eventually, individual micro-architecture's SASS assembly language. Thus it is not supposed to be encumbered by specific holes/gaffs/flukes/idiosyncrasies in the actual instruction sets of specific nVIDIA GPU micro-architectures.

Now, PTX has an instruction for counting the number leading zeros in a register: clz. Yet - it lacks a corresponding ctz instruction, which counts the number trailing zeros. These operations are 'symmetric' and one would certainly expect to see either both or none in an instruction set - again, especially if its abstract and not bound to what's available on a specific piece of hardware. Popular CPU architectures have had both for many years.

Strangely enough, the CUDA header device_functions.h declares the function

 * \brief Find the position of the least significant bit set to 1 in a 32 bit integer.
 *
 * [etc.]
 *
 * \return Returns a value between 0 and 32 inclusive representing the position of the first bit set.
 * - __ffs(0) returns 0.
 */
__DEVICE_FUNCTIONS_DECL__ __device_builtin__ int                    __ffs(int x);

This function:

has almost the same semantics as count-trailing-zeros - only differing on an all-zero input.
does not translate into a single PTX instruction, but rather two: bitwise negation, then a clz.
is also missing its potential counterpart, __fls - find last set.

So, why is that? Why is an apparently obvious-to-have instruction missing from PTX, and a "fake builtin" that's almost identical to it present in the headers?

Try and keep the gratuitous nonsense tag creation to a minimum. Tags are intended for search and question classification, not haiku. — talonmies, Apr 23 '17 at 10:22
@talonmies: (1) A tag regarding count-trailing-zeros (not just in CUDA) seems reasonable to have. (2) I hope that's not why you've downvoted. — einpoklum, Apr 23 '17 at 10:25
At least ARM doesn't have `ctz` but has `clz`. To count the number of trailing zeroes, you first use `rbit` to reverse bits and then `clz`. There is really no point in wasting opcode space for both rarely used functions when one can easily be expressed through the other. — fuz, Apr 23 '17 at 10:43
And note that `__ffs()` is probably present because it is part of POSIX. See [ffs() in POSIX](http://pubs.opengroup.org/onlinepubs/9699919799/functions/ffs.html). — fuz, Apr 23 '17 at 10:44
@einpoklum: Let's put this into some perspective. There are 13.7 million questions on [SO] today. Somehow, we have gotten by without a CTZ tag for almost 10 years. Somehow, I suspect we will get by without something which is obviously a meta tag (see http://stackoverflow.com/help/tagging) for another 10 years. If you feel strongly about this -- meta.stackoverflow.com is the place to go and have a meta discussion about tagging. You'll love it there, I'm sure. — talonmies, Apr 23 '17 at 11:14
@talonmies: I don't feel strongly about it. I will say that there's an element of inertia here. The first several questions on some issue typically don't add a new tag for it; and people get used to just using the general tag for it; and new users don't add tags anyways. And yet sometimes someone gets the idea to create the tag; at first it's just one question among millions, but it may be retroactively applied to other questions; and then it may or may not get adopted. I think that was the case for at least one CUDA-related tag I created last year (maybe gpu-shared-memory? I forget). — einpoklum, Apr 23 '17 at 11:22
@fux: That's an interesting point about why `ffs()` is available; but if that were the case - it should still not be presented as an intrinsic. And it's not like you have all other POSIX functions available on the device. — einpoklum, Apr 23 '17 at 11:24
There's no precise heuristic at NVIDIA to determine whether or not a particular (somewhat obscure - i.e. not obvious to everyone that it is needed) intrinsic will be implemented. If it comes to the attention of the CUDA developers, and they get a strong sense that it would be valuable, then it may happen. A correct answer (not speculation) to this question probably could only come from the CUDA designers, or possibly someone like @njuffa who used to be a CUDA designer. — Robert Crovella, Apr 23 '17 at 14:31
One of your premises appears to be "hey - it's a virtual architecture, why not have symmetry?" or something akin to that. I can say that instruction set (virtual or otherwise) bloat has definite development, maintenance, and QA costs, so simply adding instructions for "symmetry" might not be a very strong motivator. — Robert Crovella, Apr 23 '17 at 14:33
@einpoklum Indeed. However, `ffs()` is the only standard function that provides this kind of functionality, so chosing its interface over other possible designs isn't a bad idea. — fuz, Apr 23 '17 at 15:10
@RobertCrovella: Well, that's an answer to my question; perhaps make it one? — einpoklum, Apr 23 '17 at 15:20
@einpoklum: That isn't an anwer, it is well informed speculation. And that is the problem with the question. **All** answers you are likely to get are either going to be opinion or speculation. — talonmies, Apr 23 '17 at 16:53
@talonmies: I've had this argument on other "why is XYZ the case" many times. No, you're wrong. Speculative answers should not be answers. The question's implicit interpretation is the one that's in line with what is on-topic here, so only answers based on _knowledge_ are valid. That knowledge can be of the history of some decision (which Robert might be privy to), or of some technical issue which precludes out other options or makes the one you ask about more obvious. And asking such questions, OPs do not need to spell out "only non-opinion-based answers please", it comes with the territory. — einpoklum, Apr 23 '17 at 17:30

score 3 · Answer 1 · answered Apr 23 '17 at 19:53

Generally speaking, as with the x86 architecture, many features of CUDA and GPU architecture have accumulated organically, based on customer feedback and demands, rather than originating in some grand unified orthogonal design.

I personally added the __ffs() and __ffsll() device function intrinsics to CUDA. They were included because they represent useful bit-manipulation primitives and exactly match the ffs() functionality defined by POSIX.

For bit manipulations, in particular for the implementation of fixed-point operations and floating-point emulation, CLZ is a much more important operation than CTZ. Initially, I implemented __clz() and __clz() in CUDA as short emulation sequences. Hardware support for CLZ was added later. I was not part of the hardware architecture team, but I am reasonably sure the instruction was added based on customer feedback.

One of the major goals of PTX is to expose underlying hardware functionality in an abstracted form, as each GPU generation makes significant changes to the actual microarchitecture of the hardware. This virtual ISA is intended as a thin wrapper around native instruction sets. Native GPU instruction sets are quite minimal. For example, there are no instructions for divisions. Simple hardware makes for smaller die area and higher core counts.

To provide a practical target for existing compilers (CUDA has used the Open64 and LLVM toolchains), some higher level operations were added to PTX, even though underlying hardware support is lacking. As this represents a software support burden, there is probably little incentive to add more such ops. Not all of the existing emulations are of the highest possible performance. During my tenure at NVIDIA I worked on optimizing the emulation sequences for the most important PTX operations.

CUDA users can submit requests for enhancements (such as inclusion of CTZ as a PTX operation) via the bug reporting mechanism.

Can you elaborate, if only briefly, on why adding an instruction, which is not radically different from existing ones - such as `ctz` - would be a software support burden? Anyway, +1 for the explanation. Also, you mentioned the goal of exposing underlying hardware functionality; why, then, do we have `__ffs()` but not `__fls()`, which is actually available in hardware? — einpoklum, Apr 23 '17 at 20:00
@einpoklum: Adding `CTZ` as a PTX-level instruction for which there is no hardware support would require maintaining emulation code across the about four different GPU architectures that must be supported at any given time (currently: 2.x, 3.x, 5.x, 6.x). I do not recall *any* user requests for `__fls()` and in fact I am not familiar with that function [where is that specified?]. My suggestion: File RFEs for what you want and see what happens. — njuffa, Apr 23 '17 at 20:04
I will file RFEs, but I wanted to understand whether such an RFE makes sense or whether there's something I ignoring. anyway, there's an FLO - find last one - SASS operation. See the answers to [this question](http://stackoverflow.com/q/43564727/1593077). — einpoklum, Apr 23 '17 at 20:12
No idea when `FLO` was added to the native GPU instruction set, and which GPU generations support it (sometimes instructions that used to be available disappear in future architectures; thus the need for PTX). `FLO` could be a low-throughput instruction, you may want to benchmark. — njuffa, Apr 23 '17 at 20:21

Why does CUDA PTX have clz but no ctz, and CUDA headers have "fake ffs" but no fls?

1 Answers1