How to efficiently count leading zeros in a 24 bit unsigned integer?

Question

Most of the clz() (SW impl.) are optimized for 32 bit unsigned integer.

UPD. Target's characteristics:

CHAR_BIT                 24
sizeof(int)              1
sizeof(long int)         2
sizeof(long long int)    3

Are you talking about a real data type holding only 24 bits or are you talking about using 24 bits within a native 32bit type? — Gerhardh, Dec 10 '21 at 17:22
How is your 24-bit unsigned integer stored? In a 32-bit int, or as something like `uint8_t[3]`? — Steve Summit, Dec 10 '21 at 17:42
@Ben endian doesn't matter in this context. And leading means the highest to lowest bit. — Fredrik, Dec 10 '21 at 18:33
@dbush Similar [question](https://stackoverflow.com/q/39517650/1778275). — pmor, Dec 10 '21 at 18:48
@Fredrik Of course it matters. This is bit-twiddling exercise, so you need to know how the bits are laid out in memory. — Ben, Dec 10 '21 at 19:52
@Ben no you don't. Endian only matters if you break down a multi part byte type into a lesser one, for example casting an int pointer to a char pointer. Using bitwise operations on a type doesn't care about endian. — Fredrik, Dec 10 '21 at 20:10
@Ben for example' "n & 0xFF" will always give you the least significant byte of n, regardless of endian! — Fredrik, Dec 10 '21 at 20:11
@Fredrik consider the answer given below by O____ which relies on bit layout. — Ben, Dec 10 '21 at 20:39
@dbush OP is clearly programming a DSP or some such embedded CPU, where 24bit types are common. — Ben, Dec 10 '21 at 20:42
@Ben yes if you are gonna memcpy it matters. But OP has an int which is 24 bits :) — Fredrik, Dec 10 '21 at 20:44
@Fredrik Looks like user has a 48-bit uint he can use. Eyeballing it looks like the uint32_t clz would work just fine substituting uint42_t??? — Ben, Dec 13 '21 at 09:20

rici · Accepted Answer · 2021-12-11T20:56:44.200

TL;DR: See point 4 below for the C program.

Assuming your hypothetical target machine is capable of correctly implementing unsigned 24-bit multiplication (which must return the low-order 24 bits of the product), you can use the same trick as is shown in the answer you link. (But you might not want to. See [Note 1].) It's worth trying to understand what's going on in the linked answer.

The input is reduced to a small set of values, where all integers with the same number of leading zeros map to the same value. The simple way of doing that is to flood every bit to cover all the bit positions to the right of it:

    x |= x>>1;
    x |= x>>2;
    x |= x>>4;
    x |= x>>8;
    x |= x>>16;

That will work for 17 up to 32 bits; if your target datatype has 9 to 16 bits, you could leave off the last shift-and-or because there is no bit position 16 bits to the right of any bit. And so on. But with 24 bits, you'll want all five shift-and-or.

With that, you've turned x into one of 25 values (for 24-bit ints):

       x clz         x clz         x clz         x clz         x clz
-------- ---  -------- ---  -------- ---  -------- ---  -------- ---
0x000000  24  0x00001f  19  0x0003ff  14  0x007fff   9  0x0fffff   4
0x000001  23  0x00003f  18  0x0007ff  13  0x00ffff   8  0x1fffff   3
0x000003  22  0x00007f  17  0x000fff  12  0x01ffff   7  0x3fffff   2
0x000007  21  0x0000ff  16  0x001fff  11  0x03ffff   6  0x7fffff   1
0x00000f  20  0x0001ff  15  0x003fff  10  0x07ffff   5  0xffffff   0

Now, to turn x into clz, we need a good hash function. We don't necessarily expect that hash(x)==clz, but we want the 25 possible x values to hash to different numbers, ideally in a small range. As with the link you provide, the hash function we'll choose is to multiply by a carefully-chosen multiplicand and then mask off a few bits. Using a mask means that we need to choose five bits; in theory, we could use a 5-bit mask anywhere in the 24-bit word, but in order to not have to think too much, I just chose the five high-order bits, the same as the 32-bit solution. Unlike the 32-bit solution, I didn't bother adding 1, and I expect to distinct values for all 25 possible inputs. The equivalent isn't possible with a five-bit mask and 33 possible clz values (as in the 32-bit case), so they have to jump through an additional hoop if the original input was 0.

Since the hash function doesn't directly produce the clz value, but rather a number between 0 and 31, we need to translate the result to a clz value, which uses a 32-byte lookup table, called debruijn in the 32-bit algorithm for reasons I'm not going to get into.
An interesting question is how to select a multiplier with the desired characteristics. One possibility would be to do a bunch of number theory to elegantly discover a solution. That's how it was done decades ago, but these days I can just write a quick-and-dirty Python program to do a brute force search over all the possible multipliers. After all, in the 24-bit case there are only about 16 million possibilities and lots of them work. The actual Python code I used is:
```
# Compute the 25 target values
targ=[2**i - 1 for i in range(25)]
# For each possible multiplier, compute all 25 hashes, and see if they
# are all different (that is, the set of results has size 25):
next(i for i in range(2**19, 2**24)
       if len(targ)==len(set(((i * t) >> 19) & 0x1f
                              for t in targ)))
```
Calling next on a generator expression returns the first generated value, which in this case is 0x8CB4F, or 576335. Since the search starts at 0x80000 (which is the smallest multiplier for which hash(1) is not 0), the result printed instantly. I then spent a few more milliseconds to generate all the possible multipliers between 2¹⁹ and 2²⁰, of which there are 90, and selected 0xCAE8F (831119) for purely personal aesthetic reasons. The last step is to create the lookup table from the computed hash function. (Not saying this is good Python. I just took it from my command history; I might come back and clean it up later. But I included it for completeness.):
```
lut = dict((i,-1) for i in range(32))
lut.update((((v * 0xcae8f) >> 19) & 0x1f, 24 - i)
           for i, v in enumerate(targ))
print("  static const char lut[] = {\n    " +
      ",\n    ".join(', '.join(f"{lut[i]:2}" for i in range(j, j+8))
                     for j in range(0, 32, 8)) +
      "\n  };\n")
# The result is pasted into the C code below.
```

So then it's just a question of assembling the C code:

// Assumes that `unsigned int` has 24 value bits.
int clz(unsigned x) {
  static const char lut[] = {
    24, 23,  7, 18, 22,  6, -1,  9,
    -1, 17, 15, 21, 13,  5,  1, -1,
     8, 19, 10, -1, 16, 14,  2, 20,
    11, -1,  3, 12,  4, -1,  0, -1
  };
  x |= x>>1;
  x |= x>>2;
  x |= x>>4;
  x |= x>>8;
  x |= x>>16;
  return lut[((x * 0xcae8f) >> 19) & 0x1f];
}

The test code calls clz on every 24-bit integer in turn. Since I don't have a 24-bit machine handy, I just assume that the arithmetic will work the same on the hypothetical 24-bit machine in the OP.

#include <stdio.h>

# For each 24-bit integer in turn (from 0 to 2**24-1), if
# clz(i) is different from clz(i-1), print clz(i) and i.
#
# Expected output is 0 and the powers of 2 up to 2**23, with
# descending clz values from 24 to 0.
int main(void) {
  int prev = -1;
  for (unsigned i = 0; i < 1<<24; ++i) {
    int pfxlen = clz(i);
    if (pfxlen != prev) {
      printf("%2d 0x%06X\n", pfxlen, i);
      prev = pfxlen;
    }
  }
  return 0;
}

Notes:

If the target machine does not implement 24-bit unsigned multiply in hardware --i.e., it depends on a software emulation-- then it's almost certainly faster to do the clz by just looping over initial bits, particularly if you fold the loop by scanning several bits at a time with a lookup table. That might be faster even if the machine does do efficient hardware multiplies. For example, you can scan six bits at a time with a 32-entry table:
```
// Assumes that `unsigned int` has 24 value bits.
int clz(unsigned int x) {
  static const char lut[] = {
    5, 4, 3, 3, 2, 2, 2, 2,
    1, 1, 1, 1, 1, 1, 1, 1,
    0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0
  };
  /* Six bits at a time makes octal easier */
  if (x & 077000000u) return lut[x >> 19];
  if (x &   0770000u) return lut[x >> 13] + 6;
  if (x &     07700u) return lut[x >>  7] + 12;
  if (x             ) return lut[x >>  1] + 18;
  return 24;
}
```
That table could be reduced to 48 bits but the extra code would likely eat up the savings.

A couple of clarifications seem to be in order here. First, although we're scanning six bits at a time, we only use five of them to index the table. That's because we've previously verified that the six bits in question are not all zero; in that case, the low-order bit is either irrelevant (if some other bit is set) or it's 1. Also, we get the table index by shifting without masking; the masking is unnecessary because we know from the masked tests that all the higher order bits are 0. (This will, however, fail miserably if x has more than 24 bits.)

Confirm: on a 24-bit machine's simulator the test code produces expected output. Since `1<<24` overflows the `for` was changed to `for (unsigned long long i = 0; i <= 0xffffff; ++i)`. — pmor, Dec 27 '21 at 17:42
Legal aspect: do you allow to use your `clz` in commercial software? Asking because _the content contributed on or after 2018-05-02 (UTC) is distributed under the terms of CC BY-SA 4.0_ ([link](https://stackoverflow.com/help/licensing)). And CC BY-SA 4.0 _may have_ (compatibility) issues with licenses for commercial / proprietary software. — pmor, Dec 28 '21 at 20:39
@pmor I don't assert intellectual property rights to that answer. Use it as you please. — rici, Dec 28 '21 at 20:58
About the `clz` in the item 4: is there a need of `if ( x == 0 ) return 24;`? — pmor, Dec 29 '21 at 13:55
Why would you need that? It demonstrably works without, and it's easy to see why clz(0) must be 24. — rici, Dec 29 '21 at 14:25
Higher performance? Or will the non-taken `if` branch affect the performance? Guess it depends on a target. Also it may be known in advance that (for example) the prob. of input 0 is very low / high. — pmor, Dec 29 '21 at 14:58
To justify a special case check for a single value, you'd need to believe that that single value dominates the calls. If the special value is uncommon, so are the savings for eliminating the computation. If the special value is common but not dominant, branch prediction is unlikely to work and the test will add to the cost of every call. The code as provided is branch free, making pipelining simpler. You're free to benchmark a specific use case, but my guess is that for any normal call mix, a special case check increases average execution time. In any case, it is not "needed". — rici, Dec 29 '21 at 15:09
If you're comparing my code with the 32-bit version, I do explain (briefly, at the end of point 2) why the 32-bit version does need the special case check. You can't put 33 different values into 32 pigeonholes. — rici, Dec 29 '21 at 15:12

score 1 · Answer 2 · answered Dec 10 '21 at 17:20

1

Convert the 24 bit integer into a 32 bit one (either by type punning or explicitly shuffling around the bits), then to the 32 bit clz, and subtract 8.

Why do it that way? Because in this day and age you'll be hard pressed to find a machine that deals with 24 bit types, natively, in the first place.

answered Dec 10 '21 at 17:20

datenwolf

159,371
13
185
298

The question indicates it is concerned with software implementations, so hardware clz is irrelevant. There are software solutions, but OP is seeking something that would “prune away” parts of the code that deal with the high eight bits or otherwise optimize for 24 bits rather than 32. So this answer provides nothing of relevance. Also, there is no need for punning or “shuffling” bits; a simple cast to `uint32_t` would produce the value padded to 32 bits, if that were desired. – Eric Postpischil Dec 10 '21 at 17:50
@EricPostpischil if 24 bits value is stored in 3 bytes byte array (for example data from sensor) it will be hard to cast. – 0___________ Dec 10 '21 at 18:03
@0___________: Then you still cannot type-pun it into a 32-bit integer (type-punning accesses with the new type, so it will load 32 bits), and “bit shuffling” is irrelevant. – Eric Postpischil Dec 10 '21 at 18:05
@EricPostpischil you can memcpy it (assuming same endianess) – 0___________ Dec 10 '21 at 18:06
And if there is no "32 bit one"? See UPD. – pmor Dec 10 '21 at 18:49
@pmar Uniparental disomy? – 0___________ Dec 10 '21 at 18:57
24bit types are common on DSPs and various microcontroller type things. OP explicitly told us his `long long int` is 24 bits. He needs a `clz` for a native 24bit type. – Ben Dec 10 '21 at 20:45
1

@Ben Just to be clear, `sizeof(long long int)` returning 3 with `CHAR_BIT = 24`, means that a `long long int` is 72 bits. – nnn Dec 10 '21 at 21:41
So on this platform, the smallest unit of addressable memory is 24 bit, it could be named a "byte". @pmor (OP) could you confirm? – nnn Dec 10 '21 at 21:56

0___________ · Answer 3 · 2021-12-10T18:55:58.127

I would look for the builtin function or intrinsic available for your platform and compiler. Those functions usually implement the most efficient way of finding the most significant bit number. For example, gcc has __builtin_clz function.

If the 24 bit integer is stored in a byte array (for example received from sensor)

#define BITS(x)  (CHAR_BIT * sizeof(x) - 24)
int unaligned24clz(const void * restrict val)
{
    unsigned u = 0;
    memcpy(&u, val, 3);

    #if defined(__GNUC__)
    return __builtin_clz(u) - BITS(u);
    #elif defined(__ICCARM__)
    return __CLZ(u) - BITS(u);
    #elif defined(__arm__)
    return __clz(u) - BITS(u);
    #else 
    return clz(u) - BITS(u); //portable version using standard C features
    #endif
}

If it is stored in valid integer

int clz24(const unsigned u)
{
    #if defined(__GNUC__)
    return __builtin_clz(u) - BITS(u);
    #elif defined(__ICCARM__)
    return __CLZ(u) - BITS(u);
    #elif defined(__arm__)
    return __clz(u) - BITS(u);
    #else 
    return clz(u) - BITS(u); //portable version using standard C features
    #endif
}

https://godbolt.org/z/z6n1rKjba

You can add more compilers support if you need.

Remember if the value is 0 the value of the __builtin_clz is undefined so you will need to add another check.

How to efficiently count leading zeros in a 24 bit unsigned integer?

3 Answers3

Notes: