1

In my code, I found that the processor is spending most of the time on the function shown below. The objective of the loop is that it should find out the val1 value that satisfies the condition present inside the loop. Variables Val1 and a are of type long long int (64 bit). And also, they are local non-static variables declared inside the function.

long long int findval(long long int x)
{

  long long int Val1,a=x;

  for (Val1 = 63; Val1 > 22; Val1--) 
  {
        if (((a >> Val1) & 1) == 1) 
            break;
  }

  return Val1;
}

Is there any other simple/optimized way to find out the Val1 value?

rkc
  • 111
  • 8
  • 4
    There may well be an instrinsic that does the job for you (finds the highest set bit). If it is < 22 then there isn't a bit set in that range. For example `Val1 = __builtin_ctzll(a)` in GCC finds the *lowest* set bit of a 64-bit value. – Weather Vane Feb 21 '20 at 09:26
  • Does the architecture the code will run on have a barrel shifter? If not, shifting a temp variable by 1 each time round the loop could be faster. – Andrew Morton Feb 21 '20 at 09:27
  • https://stackoverflow.com/questions/671815/what-is-the-fastest-most-efficient-way-to-find-the-highest-set-bit-msb-in-an-i – M.M Feb 21 '20 at 09:28
  • `int i = 0; while( a = a>>1 ) {i++;}` – Banzay Feb 21 '20 at 09:30
  • 1
    Are you trying to find the location of the most significant bit that is set, or do you just wish to see if any bit in the data is set? Those are two different algorithms. – Lundin Feb 21 '20 at 09:37
  • With two conditional checks you can split the value so that you then only have to make 8 loops/shifts. For example `if(a > 0x7FFFFFFF)` reduces it to one of two 32-bit ranges, one more check in each of those two cases then reduces it to checking an 8-bit range. – Weather Vane Feb 21 '20 at 09:49
  • 1
    Ok this code finds the left-most bit set, so it's wrong then. Also, it doesn't make much sense to use a signed type, because if you get negative numbers, the MSB will always be the sign bit. – Lundin Feb 21 '20 at 09:58
  • So is the problem to fix the code so that it finds the right-most bit set, or is the problem that the code is slow? Or is the problem that you don't know what the code does? – Lundin Feb 21 '20 at 10:03
  • Ok the function must be rewritten from scratch then. You mean the least significant bit set from byte 3 (bit 23) and upwards, yeah? – Lundin Feb 21 '20 at 10:10
  • (I'm not gonna answer this since it's apparently a "changing the goal posts over time" question, but someone else might make the attempt) – Lundin Feb 21 '20 at 10:14
  • @Lundin Sorry, I was confused. What you said is right. I'm trying to find the most significant set bit position. – rkc Feb 21 '20 at 10:16
  • 1
    We normally count bits from the least significant = first, and function names like [POSIX `ffs`](https://linux.die.net/man/3/ffs) (find first set = count least-significant zeros) reflect this. Your title is backwards for what you say you want in that last comment. You want GNU C `__builtin_clzll`, or equivalent for non-GNU compilers. Unfortunately C still hasn't bothered to give a portable way to take advantage of the HW support for this in various ISAs (Rust has leading/trailing zeros, and popcnt, as basic operations on integer types...) Or you get lucky with compilers recognizing a loop. – Peter Cordes Feb 21 '20 at 10:35
  • 1
    A more interesting question might be why you have to call that function so often. Perhaps the optimization is not in the function, but in exploiting knowledge about `x` from your (global) algorithm. – M Oehm Feb 21 '20 at 11:50

4 Answers4

3

For some reason I thought this was tagged x86 and/or x86-64 at some point. My GNU C answer works with any ISA, but I focused on x86-specific intrinsics for MSVC, and how it compiles for x86 with GCC/clang. Unfortunately there isn't a fully portable way to do this efficiently, so it's definitely worth doing some #ifdef to take advantage of HW support for this operation on targets you care about.


It seems you want max(22, 63 - clz(x)), where clz is some count-leading-zeros function. e.g. in GNU C, __builtin_clzll(). 63-clz(x) is the position of the MSB, when long long = int64_t like it does on x86.

Your Val1 > 22 loop condition becomes false at Val1 = 22, so that's the non-break way out of the loop if no set bit is found by then.

__builtin_clzll has undefined behaviour when its input is zero (so it can compile to 63 - a bsr instruction on x86). We can handle this and the lower bound of 22 by setting that bit in the input before running a bit-scan.

#include <limits.h>
inline
int MSB_position_clamped (long long x)
{
    int maxpos = CHAR_BIT * sizeof(x) - 1;
    x |= 1LL << 22;              // avoid x==0 UB and make clz at least 22
    return maxpos - __builtin_clzll(x);
}

For MSVC, you'd want _BitScanReverse64 (slower on AMD) or 63 - _mm_lzcnt_u64 (requires BMI1). The _mm intrinsic version is available on all x86-64 compilers.

(As Mike points out, shift counts only need to be int. Wider shift counts are not helpful, especially when compiling for 32-bit machines where long long takes 2 registers).

This compiles efficiently for x86-64, especially with clang (Godbolt). We'd also expect it to inline efficiently to these 2 instructions.

# clang 9.0 -O3 for x86-64 System V
MSB_position_clamped:
        or      rdi, 4194304
        bsr     rax, rdi
        ret

(x86 legacy bit-scan instructions find the bit-position directly, like you want. BMI1 lzcnt is faster on AMD, but actually counts leading zeros so you do need to subtract it from the type width. Even when GCC uses BSR, it's fails to optimize 63 - clz back into just BSR; it flips it twice.)


Note that negative 2's complement integer have their MSB set even though the only significant bits are lower. Are you sure you want a signed type for this?

If so, are you sure you don't want GNU C __builtin_clrsbll? (Returns the number of Leading Redundant Sign Bits in x, i.e. the number of bits following the most significant bit that are identical to it) There's no single x86 instruction, but I assume it does it efficiently with a bit-scan on ~x and combine somehow.

Also, if your original code was intended to be fully portable to all ISO C implementations, I'm not sure it's guaranteed that the sign bit shifts to lower bit positions. I wouldn't expect it for signed right shifts on a sign/magnitude C implementation. (ISO C leaves it up to the implementation whether right shifts on signed integer types are logical or arithmetic; sane / good quality implementations pick arithmetic. With 2's complement integers your code would work either way; you don't care whether it shifts in zeros or copies of the sign bit.)


Many CPUs (not just x86) have bit-scan instructions that do this in one hardware instruction, but it's AFAIK not possible to write portable C that will compile to such an instruction. ISO C hasn't bothered to add standard functions that can use such instructions when they exist. So the only good options is compiler-specific extensions. (Some compilers do recognize popcount loops, but with your loop stopping at 22 instead of 0, it's unlikely that it would fit the pattern for CLZ recognition if any compilers even look for that.) Some languages are better at this than C, notably Rust has very well designed integer primitives that include bit-scans.

GNU C __builtin_clzll() compiles to a hardware instruction on ISAs that have one, or falls back to calling a library function if not. (IDK how efficient the fallback is; it might use a byte or nibble at a time LUT instead of naive shifting.)

On 32-bit x86, __builtin_clzll uses bsr on the low and high halves and combines the results with cmov or a branch. The pure intrinsics like _BitScanReverse64 and _mm_lzcnt_u64 aren't available in 32-bit mode so you'd have to do that yourself if you use intrinsics instead of GNU C "portable" builtin functions.

32-bit code is not as nice as 64-bit code, but it's still non-looping. (And your loop gets very inefficient; GCC doesn't "think of" trying the high 32 bits in a separate loop before the low 32 bits, so it has to shrd / sar and then cmov based on a bit-test for the shift count being above 32. (Godbolt). Clang still fully unrolls, and does take advantage of only testing the relevant half of the number.


Since you tagged this SIMD, x86 AVX512CD actually has an instruction for lzcnt on 2, 4, or 8x int64_t element in one vector register: vplzcntq. The intrinsic is __m512i _mm512_lzcnt_epi64(__m512i a);.

All real CPUs with any AVX512 support have AVX512CD.

On Skylake-X and Ice Lake, it decodes to a single uop with 4 cycle latency, 0.5 clock throughput. (https://uops.info/). (It looks like it runs on the same ports as FMA/mul/add FP instructions, probably using the same hardware that normalizes floating-point mantissas, an operation that also requires finding the MSB.)

So hopefully GCC and clang can auto-vectorize code that uses __builtin_clzll when you compile with -march=skylake-avx512, or with -march=native on such machines.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • @EricPostpischil: Thanks, fixed. I had a 21 embedded in a surprising number of places; including foolishly the function name; took a while to fix >. – Peter Cordes Feb 21 '20 at 13:14
2

First of all, keep in mind that just because you have found that the processor is spending most of the time on that function snippet, it does not mean that there is a problem with the snippet. Maybe you should try and find out why your code is invoking that snippet so often.

Secondly, since you have come here asking for help, you might as well show us everything you have, instead a subset of what you have which you believe should be enough for us to figure out what is wrong. Most importantly, you really should show us exactly how your variables are declared and also exactly where they are declared. Are they function-local? Are they static? Could it be that you have declared something as volatile? Nothing is irrelevant, everything counts.

In any case, if we are to assume that the snippet can be optimized, then I'd say the following:

Your Val1 should not be a long long int, because its values are only in the range between 23 and 63. So, it should be an int instead.

(If for some reason Val1 must be calculated as a long long int, then try casting it to another variable which is of type int before the loop, and use that variable in the loop.)

If you try that, then the compiler might be able to figure out that what you are trying to do is find the first non-zero bit within a range of bits, and replace your entire loop with a single machine instruction.

Mike Nakis
  • 56,297
  • 11
  • 110
  • 142
  • The type of `Val1` doesn't matter much since the right operand of shift doesn't take part of integer promotion. The compiler should be able to optimize that one to an 8 bit type regardless of its declared type. – Lundin Feb 21 '20 at 10:01
  • @Lundin true, but compilers have been known to have quirks, (or bugs if you wish,) where simply using a wrong data type might prevent an optimization from kicking in and replacing a piece of code with an intrinsic. – Mike Nakis Feb 21 '20 at 10:17
  • 1
    True, though on gcc x86 the OP's code results in a `mov eax, 63` so it is working on 32 bits despite the long long. clang does the same, except it unrolls the whole loop. – Lundin Feb 21 '20 at 10:25
  • @Lundin: That doesn't imply much. `mov eax,63` is how you put a small constant into a 64-bit register. Compilers knowing about [x86-64 implicit zero-extension when writing a 32-bit register](https://stackoverflow.com/questions/11177137/why-do-x86-64-instructions-on-32-bit-registers-zero-the-upper-part-of-the-full-6) is basically separate from knowing they can actually optimize to a narrower type in a case where it *wasn't* a compile-time constant. I wouldn't be worried about it on a 64-bit ISA, it's at worst some extra REX prefixes for operand-size on x86-64, but could be worse on 32-bit. – Peter Cordes Feb 21 '20 at 10:49
  • @PeterCordes the OP had initially included an `x64` tag in the post, but it is no longer there. If the OP is in fact using a 32-bit architecture, then yeah, it is pretty obvious why unnecessary use of 64-bit quantities kills performance. – Mike Nakis Feb 21 '20 at 10:53
  • Mike: yeah, I checked and 32-bit builds make nasty code with a 64-bit shift Val1. (`long long` https://godbolt.org/z/UfTLS4 vs. `int` https://godbolt.org/z/vRGg5m, except clang fully unrolls so it just needs a single 32-bit bitmask for each shift count). @Lundin: from the first Godbolt link, we see GCC being braindead and actually doing `add eax, -1` ; `adc edx, -1` to decrement a 64-bit integer inside the loop. – Peter Cordes Feb 21 '20 at 11:09
  • But IMO the rest of this answer is too generic for the question. Yes in theory loop pattern recognition is possible, but it doesn't happen in practice, even with `int`. And is pretty unlikely to with a loop that stops at bit 22, instead of having the exact form that would match a `__builtin_clzll` or `_mm_lzcnt_u64`. It sucks that C doesn't portably expose this widely-supported hardware feature, but if you want performance it means you have to write non-portable code. (Or use Rust which has bitscan, popcnt and rotate operations as primitives on all integer types.) – Peter Cordes Feb 21 '20 at 11:12
1

Warning: I wrote my answer the wrong way (first bit on the right), sorry. Anyway, the approaches can be easily adapted to the MSb.


You can shortcut the process by means of a lookup table. You precompute the index of the rightmost bit for all numbers from 0 to 2^k-1. You will process your number in slices of k bits at a time, and try the slices from right to left, until the slice is nonzero.

An interesting option is to map your long long to an array of eight bytes; the bytes correspond to a lookup-table of 256 entries. This way, you benefit from direct by-byte addressing.

Processing by shorts is also possible, at the expense of a LUT of 65536 (64K) entries. The optimum might lie in between. There are cache effects.


Another useful approach is by dichotomy: mask out the 32 high order bits (or load the low int) and test for zero. Then with the nonzero part, mask out the 16 high order bits, and so on. In the end, use the LUT trick. With just 3 steps, you reduce from 64 to 8.

This is appropriate if the distribution of the bit index is uniform. If it is biased towards small values, sequential search can be better anyway.

  • Yet another option is to iterate byte per byte, then do a nibble-wise look-up. Ever so slightly slower than the 256/64kib alternatives, but only requires a 16 byte look-up. – Lundin Feb 21 '20 at 10:21
  • @Lundin: it is certainly worth comparing versions with different chunk sizes. I don't fear 8 bits LUTs. –  Feb 21 '20 at 10:23
  • I'm working with low-level embedded so the nibble LUT is the most common one I use, as a good compromise between speed and memory use. Though nowadays, 256 byte flash isn't a big deal either on most MCUs I suppose. – Lundin Feb 21 '20 at 10:28
  • 1
    @Lundin: The OP tagged this `[simd]` so I think we can assume that it's not a *tiny* embedded machine. You could in theory use x86 SSSE3 `pshufb` as 16x nibble-LUT lookups in parallel, but you're probably much better off using one scalar `bsr` instruction for 16 nibbles = 8 bytes! – Peter Cordes Feb 21 '20 at 11:15
1

If you can use a GCC intrisic then you could try something like this

Please note that this assumes that x is not 0 because the result of __builtin_clzll() is not defined when x is 0

#include <limits.h>

long long int findval(long long int x)
{
    // Get size of long long in bits
    size_t llsize = sizeof(long long) * CHAR_BIT;

    // Subtract count of leading zeros from size of long long
    return llsize - __builtin_clzll(x);
}
txk2048
  • 281
  • 3
  • 15
  • 1
    Note that if `x == 0` the result of `__builtin_clz` *et al* is undefined. – Paul R Feb 21 '20 at 10:57
  • 3
    @PaulR: useful hack: `__builtin_clzll(x | 1)` so the input always has at least the last bit (in clz order) set. Or better, set bit 22 so you get `max(clz(x), 22)` for free, which the OP wants. Probably worth doing it that way even if you're using x86 `_mm_lzcnt_u64` (which returns `64` for input = 0) – Peter Cordes Feb 21 '20 at 11:20
  • Yes, very nice - probably more efficient than an explicit test for x == 0 prior to the `__builtin_clz` I imagine. – Paul R Feb 21 '20 at 11:31
  • When a bit is found in the range, this returns one more than the original code in the question. When no bit is found in the range, the original code returns 22, but this code returns a lower value (except when the bit is exactly at position 21). – Eric Postpischil Feb 21 '20 at 12:59
  • Just noticed this has a bug: bsr = 63 - clz, not 64 - clz. (The trick to remembering this is to notice that you need to subtract from the max position, not from the type width.) See my answer for a working version that can compile (for x86-64) to just OR / BSR; Yours made a convenient copy/paste start point :) (@PaulR: yes, in cases where merging that special case with the `x=1` case is fine, it's very good and can't mispredict. Especially here where it actually *saves* instructions later by doing the clamping. This way is only 2 single-uop instructions (on Intel), 4 cycles of latency.) – Peter Cordes Feb 21 '20 at 13:02