I have a program that makes heavy use of the intrinsic command _BitScanForward
/ _BitScanForward64
(aka count trailing zeros, TZCNT, CTZ).
I would like to not use the intrinsic but instead use the according CPU instruction (available on Haswell and later).
When using gcc or clang (where the intrinsic is called __builtin_ctz
), I can achieve this by specifying either -march=haswell
or -mbmi2
as compiler flags.
The documentation of _BitScanForward only specifies that the intrinsic is available on all architectures "x86, ARM, x64, ARM64" or "x64, ARM64", but I don't just want it to be available, I want to ensure it is compiled to use the CPU instruction instead of the intrinsic function. I also checked /Oi but that doesn't explain it either.
I also searched the web but there are curiously few matches for my question, most just explain how to use intrinsics, e.g. this question and this question.
Am I overthinking this and MSVC will create code that magically uses the CPU instruction if the CPU supports it? Are there any flags required? How can I ensure that the CPU instructions are used when available?
UPDATE
Here is what it looks like with Godbolt. Please be nice, my assembly reading skills are pretty basic.
GCC uses tzcnt
with haswell/bmi2, otherwise resorts to rep bsf
.
MSVC uses bsf
without rep
.
I also found this useful answer, which states that:
- "Using a redundant rep prefix for bsr was generally defined to be ignored [...]". I wonder whether the same is true for
bsf
? - It explains (as I knew) that
bsf
is not the same astzcnt
, however MSVC doesn't appear to check for input == 0
This adds the questions: Why does bsf
work for MSVC?
UPDATE
Okay, this was easy, I actually call _BitScanForward
for MSVC. Doh!
UPDATE
So I added a bit of unnecessary confusion here. Ideally I would like to use an intrinsic __tzcnt
, but that doesn't exist in MSVC so I resorted to _BitScanForward
plus an extra check to account for 0
input.
However, MSVC supports LZCNT, where I have a similar issue (but it is used less in my code).
Slightly updated question would be: How does MSVC deal with LZCNT (instead of TZCNT)?
Answer: see here. Specifically: "On Intel processors that don't support the lzcnt
instruction, the instruction byte encoding is executed as bsr
(bit scan reverse). If code portability is a concern, consider use of the _BitScanReverse
intrinsic instead."
The article suggests to resort to bsr
if older CPUs are a concern. To me, this implies that there is no compiler flag to control this, instead they suggest to manually identify the __cpu
and then call either bsr
or lzcnt
.
In short, MSVC has no support for different CPU architectures (beyond x86/64/ARM).