5

How can I count the number of leading zeros in a 128-bit integer (uint128_t) efficiently?

I know GCC's built-in functions:

  • __builtin_clz, __builtin_clzl, __builtin_clzll
  • __builtin_ffs, __builtin_ffsl, __builtin_ffsll

However, these functions only work with 32- and 64-bit integers.

I also found some SSE instructions:

  • __lzcnt16, __lzcnt, __lzcnt64

As you may guess, these only work with 16-, 32- and 64-bit integers.

Is there any similar, efficient built-in functionality for 128-bit integers?

user1494080
  • 2,064
  • 2
  • 17
  • 36
  • 1
    I assume solving it for two 64 bit integers, then combining, is too expensive for ya? – Yakk - Adam Nevraumont Feb 10 '15 at 03:28
  • Well, I have to do it that way, provided no one knows a better solution. However, a single instruction would be probably more efficient and definitely more beautiful than the whole shifting, casting, branching etc stuff. – user1494080 Feb 10 '15 at 03:47
  • You can hide ugly by wrapping it up in a function. – Mark Ransom Feb 10 '15 at 04:33
  • 2
    Does this help? https://mischasan.wordpress.com/2011/11/03/sse2-bit-trick-ffsfls-for-xmm-registers/ – John Zwinck Feb 10 '15 at 04:40
  • What makes you think that your uint128_t is using an SSE register? It's likely using two 64-bit registers anyway. [SSE registers don't have a 128-bit FLAGS register so they are less useful for big integer arithmetic](https://stackoverflow.com/questions/27923192/practical-bignum-avx-sse-possible/27978043#27978043). The `bsr` and `lzcnt` instructions set the zero and carry flags so you should be able to use that to your advantage. – Z boson Feb 10 '15 at 07:35
  • @JohnZwinck, that's an interesting link, but the OP wants the leading zeros in `uint128_t` not the leading zeros in a SSE register. – Z boson Feb 10 '15 at 08:20
  • I don't think that an `uint128_t` uses an SSE register. However, if there was an SSE instruction for my purpose, it might make sense to load the `uint128_t` into an SSE register and then execute the corresponding instruction. – user1494080 Feb 10 '15 at 13:41

3 Answers3

6
inline int clz_u128 (uint128_t u) {
  uint64_t hi = u>>64;
  uint64_t lo = u;
  int retval[3]={
    __builtin_clzll(hi),
    __builtin_clzll(lo)+64,
    128
  };
  int idx = !hi + ((!lo)&(!hi));
  return retval[idx];
}

this is a branch free variant. Note that more work is done than in the branchy solution, and in practice the branching will probably be predictable.

It also relies on __builtin_clzll not crashing when fed 0: the docs say the result is undefined, but is it just unspecified or undefined?

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • One has to append `+ 64` to `__builtin_clzll(lo)`, right? An equivalent alternative to `__builtin_clzll()` which works on all inputs is `64 - __builtin_ffsll()`. – user1494080 Feb 10 '15 at 14:40
  • @user1494080 yes, oops. If `__builtin_clzll` undefinedness is just "I output garbage" we are ok. If it is "I crash", that is not ok. I don't know what they mean by "undefined"; maybe they defined what they mean by it somewhere. – Yakk - Adam Nevraumont Feb 10 '15 at 14:47
  • I accepted this approach since it is a tiny bit faster in my application than the other. – user1494080 Feb 10 '15 at 23:47
  • I can't imagine a scenario where it would 'crash' - I think it's like trying to do a shift (shlq/shrq) by more than 63 bit positions. Technically, Intel may have left itself room to make it fault in the future, but ... it's never going to happen - and it would break older code that similarly ignores the results of a zero value. It would be interesting to know what code is generated with the `-mlzcnt` flag. – Brett Hale Feb 13 '15 at 09:57
  • Yakk and @BrettHale: The gcc docs don't use the term "undefined behaviour", so I think they just mean the *value* of the result in undefined in that case. We can be much more specific when compiling for x86, where it will compile to [LZCNT](http://www.felixcloutier.com/x86/LZCNT.html) (fully defined results) or the [`BSR`](http://www.felixcloutier.com/x86/BSR.html), which leaves its destination register with undefined contents when the input is 0 (and sets ZF based on the *input*, so you can detect it without a separate compare.) Intel's manual does not leave room for BSF/BSR to fault. :) – Peter Cordes Nov 06 '16 at 03:56
  • 1
    BTW, in Intel's implementations, the actual behaviour is that the destination register is unmodified, because they choose to go above and beyond the ISA specs (not breaking some specific legacy code is the usual reason). See also my discussion of the similar behaviour for TZCNT in [my Collatz-conjecture asm answer](http://stackoverflow.com/a/40355466/224132), since the situation is the same except that you can't usefully use REP BSR for good measure when you're not sure if the target CPU will decode it as BSR or LZCNT: they return opposite results. (Unlike TZCNT / BSF). – Peter Cordes Nov 06 '16 at 03:59
4

Assuming a 'random' distribution, the first non-zero bit will be in the high 64 bits, with an overwhelming probability, so it makes sense to test that half first.

Have a look at the code generated for:

/* inline */ int clz_u128 (uint128_t u)
{
    unsigned long long hi, lo; /* (or uint64_t) */
    int b = 128;

    if ((hi = u >> 64) != 0) {
        b = __builtin_clzll(hi);
    }
    else if ((lo = u & ~0ULL) != 0) {
        b = __builtin_clzll(lo) + 64;
    }

    return b;
}

I would expect gcc to implement each __builtin_clzll using the bsrq instruction - bit scan reverse, i.e., most-significant bit position - in conjunction with an xor, (msb ^ 63), or sub, (63 - msb), to turn it into a leading zero count. gcc might generate lzcnt instructions with the right -march= (architecture) options.


Edit: others have pointed out that the 'distribution' is not relevant in this case, since the HI uint64_t needs to be tested regardless.

Brett Hale
  • 21,653
  • 2
  • 61
  • 90
  • Depends on what random distribution. For uniform distribution over the entire space, definitely. For others, perhaps not. – Ben Voigt Feb 10 '15 at 05:00
  • How branchy! I would be tempted by a 3 size array lookup solution. But branch prediction will probabky make the above faster. And yes, unless the 128 but value is a fractional part of a random irrational, the assumption if uniform random distribution is questionable. However you need to test non zero high dqword even without that assumption: try it. The assumption is a red herring. – Yakk - Adam Nevraumont Feb 10 '15 at 12:45
  • @BrettHale It's not obvious how the distribution looks like, but it's no uniform distribution. However, I always have to test the high bits first, haven't I? Another question: Is the `& ~0ULL` part necessary? Doesn't a narrowing conversion always truncate the high bits automatically? – user1494080 Feb 10 '15 at 13:24
  • @Yakk Could you please write a few more words about your lookup solution? I haven't got the idea yet. – user1494080 Feb 10 '15 at 13:25
  • Just want to share some research: According to § 4.7 [conv.integral] of the C++ standard, the `& ~0ULL` part is not necessary. – user1494080 Feb 10 '15 at 17:18
3

Yakk's answer works well for all kinds of targets as long as gcc supports 128 bit integers for the target. However, note that on the x86-64 platform, with an Intel Haswell processor or newer, there is a more efficient solution:

#include <immintrin.h>
#include <stdint.h>
// tested with compiler options: gcc -O3 -Wall -m64  -mlzcnt

inline int lzcnt_u128 (unsigned __int128 u) {
  uint64_t hi = u>>64;
  uint64_t lo = u;
  lo = (hi == 0) ? lo : -1ULL;
  return _lzcnt_u64(hi) + _lzcnt_u64(lo);
}

The _lzcnt_u64 intrinsic compiles (gcc 5.4) to the lzcnt instruction, which is well defined for a zero input (it returns 64), in contrary to gcc's __builtin_clzll(). The ternary operator compiles to the cmove instruction.

wim
  • 3,702
  • 19
  • 23