1

I have a program that makes heavy use of the intrinsic command _BitScanForward / _BitScanForward64 (aka count trailing zeros, TZCNT, CTZ). I would like to not use the intrinsic but instead use the according CPU instruction (available on Haswell and later).

When using gcc or clang (where the intrinsic is called __builtin_ctz), I can achieve this by specifying either -march=haswell or -mbmi2 as compiler flags.

The documentation of _BitScanForward only specifies that the intrinsic is available on all architectures "x86, ARM, x64, ARM64" or "x64, ARM64", but I don't just want it to be available, I want to ensure it is compiled to use the CPU instruction instead of the intrinsic function. I also checked /Oi but that doesn't explain it either.

I also searched the web but there are curiously few matches for my question, most just explain how to use intrinsics, e.g. this question and this question.

Am I overthinking this and MSVC will create code that magically uses the CPU instruction if the CPU supports it? Are there any flags required? How can I ensure that the CPU instructions are used when available?

UPDATE

Here is what it looks like with Godbolt. Please be nice, my assembly reading skills are pretty basic.

GCC uses tzcnt with haswell/bmi2, otherwise resorts to rep bsf. MSVC uses bsf without rep.

I also found this useful answer, which states that:

  • "Using a redundant rep prefix for bsr was generally defined to be ignored [...]". I wonder whether the same is true for bsf?
  • It explains (as I knew) that bsf is not the same as tzcnt, however MSVC doesn't appear to check for input == 0

This adds the questions: Why does bsf work for MSVC?

UPDATE

Okay, this was easy, I actually call _BitScanForward for MSVC. Doh!

UPDATE

So I added a bit of unnecessary confusion here. Ideally I would like to use an intrinsic __tzcnt, but that doesn't exist in MSVC so I resorted to _BitScanForward plus an extra check to account for 0 input.

However, MSVC supports LZCNT, where I have a similar issue (but it is used less in my code).

Slightly updated question would be: How does MSVC deal with LZCNT (instead of TZCNT)?

Answer: see here. Specifically: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."

The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpu and then call either bsr or lzcnt.

In short, MSVC has no support for different CPU architectures (beyond x86/64/ARM).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
TilmannZ
  • 1,784
  • 11
  • 18
  • What does the assembly (or disassembly) look like of the routine that is using the intrinsic? – Eljay Dec 16 '22 at 21:30
  • 1
    You probably mean [__lzcnt](https://learn.microsoft.com/en-us/cpp/intrinsics/lzcnt16-lzcnt-lzcnt64?view=msvc-170) intrinsic - it always generates `lzcnt` instruction and will be just incorrectly decoded on older cpu, the article also explains how to correctly check if the cpu supports it. _BitScanForward just generates `bsf` instruction which is available since intel 386 – dewaffled Dec 16 '22 at 22:06
  • @dewaffled Actually I mean the something like `__tzcnt`, but that doesn't appear to exist. – TilmannZ Dec 16 '22 at 22:46
  • `bsf` and `bsr` are hardware instructions since 386. `tzcnt` and `lzcnt` are faster on AMD, and give well-defined results for input=0. But not enough faster to be worth branching every time, only like once outside a loop. GCC and clang typically use `tzcnt` for `__builtin_ctz` even if it's not guaranteed to run as `tzcnt` instead of `bsf`, since it gives the same results for non-zero inputs, and they don't support taking advantage of the AMD-defined semantics of BSF leaving the destination reg unmodified on input = 0. (Intel CPUs do it too, but don't document it.) – Peter Cordes Dec 16 '22 at 23:46

3 Answers3

2

[not exactly answering the question in title, rather trying to solve OP's problem]

If you can afford a reasonably-recent compiler and can specify /std:c++20) - you can use standard C++ std::countl_zero / std::countl_one / std::countr_zero / std::countr_one ; for details, refer to https://en.cppreference.com/w/cpp/header/bit .

From personal experience: these happen to work like a charm over 3 different platforms and 3 different compilers.

No-Bugs Hare
  • 1,557
  • 14
  • 15
  • But do you know a way to tell MSVC to compile `std::countl_zero` to `lzcnt` instead of checking for non-zero and then `bsr` / `xor reg, 31` (or subtract from 31 if it doesn't know that trick)? According to https://godbolt.org/z/YTaxdKofz, it seems MSVC will do that with `-arch:AVX2` (in practice there are no CPUs with AVX2 but not BMI1, I'm pretty sure). MSVC doesn't understand ` -arch:x86-64-v3` and makes code that checks an ISA-version global at run-time, which might be not bad since the `bsr` code path also has to branch on the input being non-zero. – Peter Cordes Aug 15 '23 at 05:58
  • usually in these situations, I edit the title to represent the question better, and then look for an actual canonical Q&A for the question I set out to answer, or if it doesn't exist, create it. – starball Aug 15 '23 at 07:29
  • @PeterCordes are you 100% sure it will work faster on those platforms where compiler doesn't do it? If so - there is always an option to open a ticket against MSVC team. In general, personally I am trying to trust the compiler writers (as a big fat rule of thumb, they tend to know much better about the CPUs than mere mortals do - and BTW, having different sequences being optimal on different architectures is extremely common). – No-Bugs Hare Aug 21 '23 at 07:35
  • Where compiler doesn't do what? I'm sure `lzcnt` is faster than `bsr`/`xor` on all existing x86 CPUs, and AFAIK the only way to get MSVC to emit it for `countl_zero` is to compile with `-arch:AVX2`. There might be other ways; I don't know MSVC as well as others. A lot faster on AMD (see https://uops.info/ and https://agner.org/optimize/). And if the compiler can't prove the input is non-zero, it needs even more extra instructions than that. On Intel CPUs, `lzcnt` performs about the same as `bsr`, but on Skylake and later without a false dependency. (For `bsr` it's a true dependency). – Peter Cordes Aug 21 '23 at 15:46
  • I've filed several missed-optimization bugs over the years with GCC and clang, and have been asked for advice on what's fast a couple times by clang developers. In terms of knowing what's fast in asm, I and a few other CPU-architecture experts that post here on Stack Overflow (e.g. BeeOnRope, Andreas Abel, harold, and some others) are at or beyond the knowledge level of the average compiler developer, for the ISAs we know about. – Peter Cordes Aug 21 '23 at 15:50
  • And even though some compiler devs know architectures pretty well, compilers themselves often don't make optimal code. e.g. see [Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?](https://stackoverflow.com/q/40354978) and [gcc optimization flag -O3 makes code slower than -O2](https://stackoverflow.com/q/28875325) for analysis of why compiler-generated code is sub-optimal in those cases, and what would be faster. I might be just some random guy on the Internet, but there's my evidence I know what I'm talking about, since you asked. – Peter Cordes Aug 21 '23 at 15:51
2

There's no way to target specifically Haswell, but there's a way to assume AVX2 availability (/arch:AVX2), which also assumes tzcnt availability and other BMI1 and BMI2 instructions, along with FMA3. And there's a way to tune for a wide set of architectures (/favor option; unfortunately only three wide groups available: common Intel, common AMD, Intel Atom).

There's no reason why _BitScanReverse wouldn't generate tzcnt under /arch:AVX2, except missed optimization opportunity. For example _mm_set1_* intrinsics do generate different code under different /arch options. Even the intrinsics explicit named are not always used to generate the instruction they imply. Like _mm_load_si128 may be no-op (fused with a subsequent op using memory operand instead of a register). Or _mm_extract_pi16 with zero argument may be just movd.

I've created Developer Community issue about this missed optimization. For now, you can use <bit> functions, which do runtime detection by default, but use lzcnt / tzcnt unconditionally with /arch:AVX2.

(There's virtually no way to avoid using AVX2 and BMI instructions with /arch:AVX2 in an optimized build, as there's always some room for auto-vectorization, so _BitScanReverse cannot be "portable to legacy processors" with /arch:AVX2 and above anyway)

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • 1
    In a recent question you used `-mavx2` with GCC/Clang. The equivalent of MSVC's `/arch:AVX2` in modern GCC/Clang is `-march=x86-64-v3` which enables AVX2+FMA+BMI1+BMI2, i.e. a Haswell baseline. https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels – Peter Cordes Aug 18 '23 at 06:09
0

As I posted above, MSVC doesn't appear to have support for different CPU architectures (beyond x86/64/ARM).

This article says: "On Intel processors that don't support the lzcnt instruction, the instruction byte encoding is executed as bsr (bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse intrinsic instead."

The article suggests to resort to bsr if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpuid and then call either bsr or lzcnt depending on the result.

UPDATE

As @dewaffled pointed out, there are indeed _tzcnt_u32 / _tzcnt_u64 in the x64 intrinsics list.

I got mislead by looking at the Alphabetical listing of intrinsic functions on the left side of the pane. I wonder whether there is a distinction between "intrinsics" and "intrinsic functions", i.e. _tzcnt_u64 is an intrinsic but not an intrinsic function.

TilmannZ
  • 1,784
  • 11
  • 18
  • 1
    There are `_tzcnt_u32`/ `_tzcnt_u64` if you need those. – dewaffled Dec 16 '22 at 23:24
  • `_tzcnt_u64` will compile to a single instruction, never a function call. IDK what Intel means by "intrinsic function"; perhaps some SVML library functions like `_mm_sin_ps` where there's no hardware instruction, it's only provided in Intel's math library. – Peter Cordes Dec 16 '22 at 23:48
  • But the same is true for __lzcnt, I understand it will never compile to a function, still it is listed in the alphabetical list of intrinsic _functions_. I am also not sure whether "function" really means complex code for whether a CPU instruction is also a minimal "function"... – TilmannZ Dec 16 '22 at 23:57