For some reason I thought this was tagged x86 and/or x86-64 at some point. My GNU C answer works with any ISA, but I focused on x86-specific intrinsics for MSVC, and how it compiles for x86 with GCC/clang. Unfortunately there isn't a fully portable way to do this efficiently, so it's definitely worth doing some #ifdef
to take advantage of HW support for this operation on targets you care about.
It seems you want max(22, 63 - clz(x))
, where clz
is some count-leading-zeros function. e.g. in GNU C, __builtin_clzll()
. 63-clz(x) is the position of the MSB, when long long
= int64_t
like it does on x86.
Your Val1 > 22
loop condition becomes false at Val1 = 22
, so that's the non-break
way out of the loop if no set bit is found by then.
__builtin_clzll
has undefined behaviour when its input is zero (so it can compile to 63 - a bsr
instruction on x86). We can handle this and the lower bound of 22 by setting that bit in the input before running a bit-scan.
#include <limits.h>
inline
int MSB_position_clamped (long long x)
{
int maxpos = CHAR_BIT * sizeof(x) - 1;
x |= 1LL << 22; // avoid x==0 UB and make clz at least 22
return maxpos - __builtin_clzll(x);
}
For MSVC, you'd want _BitScanReverse64
(slower on AMD) or 63 - _mm_lzcnt_u64
(requires BMI1). The _mm
intrinsic version is available on all x86-64 compilers.
(As Mike points out, shift counts only need to be int
. Wider shift counts are not helpful, especially when compiling for 32-bit machines where long long takes 2 registers).
This compiles efficiently for x86-64, especially with clang (Godbolt). We'd also expect it to inline efficiently to these 2 instructions.
# clang 9.0 -O3 for x86-64 System V
MSB_position_clamped:
or rdi, 4194304
bsr rax, rdi
ret
(x86 legacy bit-scan instructions find the bit-position directly, like you want. BMI1 lzcnt
is faster on AMD, but actually counts leading zeros so you do need to subtract it from the type width. Even when GCC uses BSR, it's fails to optimize 63 - clz
back into just BSR; it flips it twice.)
Note that negative 2's complement integer have their MSB set even though the only significant bits are lower. Are you sure you want a signed type for this?
If so, are you sure you don't want GNU C __builtin_clrsbll
? (Returns the number of Leading Redundant Sign Bits in x, i.e. the number of bits following the most significant bit that are identical to it) There's no single x86 instruction, but I assume it does it efficiently with a bit-scan on ~x
and combine somehow.
Also, if your original code was intended to be fully portable to all ISO C implementations, I'm not sure it's guaranteed that the sign bit shifts to lower bit positions. I wouldn't expect it for signed right shifts on a sign/magnitude C implementation. (ISO C leaves it up to the implementation whether right shifts on signed integer types are logical or arithmetic; sane / good quality implementations pick arithmetic. With 2's complement integers your code would work either way; you don't care whether it shifts in zeros or copies of the sign bit.)
Many CPUs (not just x86) have bit-scan instructions that do this in one hardware instruction, but it's AFAIK not possible to write portable C that will compile to such an instruction. ISO C hasn't bothered to add standard functions that can use such instructions when they exist. So the only good options is compiler-specific extensions. (Some compilers do recognize popcount loops, but with your loop stopping at 22 instead of 0, it's unlikely that it would fit the pattern for CLZ recognition if any compilers even look for that.) Some languages are better at this than C, notably Rust has very well designed integer primitives that include bit-scans.
GNU C __builtin_clzll()
compiles to a hardware instruction on ISAs that have one, or falls back to calling a library function if not. (IDK how efficient the fallback is; it might use a byte or nibble at a time LUT instead of naive shifting.)
On 32-bit x86, __builtin_clzll
uses bsr
on the low and high halves and combines the results with cmov
or a branch. The pure intrinsics like _BitScanReverse64
and _mm_lzcnt_u64
aren't available in 32-bit mode so you'd have to do that yourself if you use intrinsics instead of GNU C "portable" builtin functions.
32-bit code is not as nice as 64-bit code, but it's still non-looping. (And your loop gets very inefficient; GCC doesn't "think of" trying the high 32 bits in a separate loop before the low 32 bits, so it has to shrd
/ sar
and then cmov
based on a bit-test for the shift count being above 32. (Godbolt). Clang still fully unrolls, and does take advantage of only testing the relevant half of the number.
Since you tagged this SIMD, x86 AVX512CD actually has an instruction for lzcnt on 2, 4, or 8x int64_t
element in one vector register: vplzcntq
. The intrinsic is __m512i _mm512_lzcnt_epi64(__m512i a);
.
All real CPUs with any AVX512 support have AVX512CD.
On Skylake-X and Ice Lake, it decodes to a single uop with 4 cycle latency, 0.5 clock throughput. (https://uops.info/). (It looks like it runs on the same ports as FMA/mul/add FP instructions, probably using the same hardware that normalizes floating-point mantissas, an operation that also requires finding the MSB.)
So hopefully GCC and clang can auto-vectorize code that uses __builtin_clzll
when you compile with -march=skylake-avx512
, or with -march=native
on such machines.