4

The following code works fine in debug mode, since _BitScanReverse64 is defined to return 0 if no Bit is set. Citing MSDN: (The return value is) "Nonzero if Index was set, or 0 if no set bits were found."

If I compile this code in release mode it still works, but if I enable compiler optimizations, such as \O1 or \O2 the index is not zero and the assert() fails.

#include <iostream>
#include <cassert>

using namespace std;

int main()
{
  unsigned long index = 0;
  _BitScanReverse64(&index, 0x0ull);

  cout << index << endl;

  assert(index == 0);

  return 0;
}

Is this the intended behaviour ? I am using Visual Studio Community 2015, Version 14.0.25431.01 Update 3. (I left cout in, so that the variable index is not deleted during optimization). Also is there an efficient workaround or should I just not use this compiler intrinsic directly?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Camleon
  • 113
  • 9
  • 1
    Initializing index right away doesnt make a difference here, since _BitScanReverse does the initialization. – Camleon Dec 27 '16 at 20:11
  • By saying "the program crashes" you mean the `assert` fails, right? It's been a while since I've used VS, but shouldn't the `assert` be disabled on release builds? – Eran Dec 27 '16 at 20:16
  • Yes, the assert fails. I think it should, but it somehow isn't :D I tried it without the assertion too and the index is still not zero – Camleon Dec 27 '16 at 20:19
  • 1
    Microsoft's page on _BitScanReverse64 does not say that `index` will be set if no bit is set; it only says `index` will be set with a value if a bit is set. So, if no bit is set, _BitScanReverse64 could just be leaving `index` alone and it's left with whatever it originally had. – Altainia Dec 27 '16 at 20:21
  • I have cited MSDN. https://technet.microsoft.com/en-us/library/fbxyd7zd which page are you referring to? – Camleon Dec 27 '16 at 20:27
  • 1
    @Altainia has the right answer imo. You should check the return value of `_BitScanReverse64`. If it's zero, the value of `index` is undefined (at least in the docs). Assuming so because only a non-zero return indicates `index` was set. – Eran Dec 27 '16 at 20:29
  • Ok, that is strange. Can you give me a link, so I can see it for myself? – Camleon Dec 27 '16 at 20:31
  • And why does it never fail when I dont optimize? – Camleon Dec 27 '16 at 20:32
  • @Altainia But `index` is set to `0` now. – juanchopanza Dec 27 '16 at 20:32
  • 1
    `index` is initialized to 0, but `_BitScanReverse64` might be changing it internally. The unoptimized version is possibly setting `index` to zero if no 1 is found, but the optimized version omits that part to save time. Both are in line with the docs (_"Nonzero if Index was set, or 0 if no set bits were found."_) and the optimized version does less hence is faster. – Eran Dec 27 '16 at 20:36
  • If the optimized version would omit the initialization of index, why is it not zero afterwards? And why is the faster version still in line with the documentation if index is 3193563 afterwards? And why is it always a quasi random number? Sorry, I still dont get it :( – Camleon Dec 27 '16 at 20:44
  • Oh, I think I got it!, Thank you, eran and Altainia! – Camleon Dec 27 '16 at 20:55
  • Good. @Altainia, care to make your comment a proper answer? – Eran Dec 27 '16 at 20:57

1 Answers1

11

AFAICT, the intrinsic leaves garbage in index when the input is zero, weaker than the behaviour of the asm instruction. This is why it has a separate boolean return value and integer output operand.

Despite the index arg being taken by reference, the compiler treats it as output-only.


unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel's intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the bsr instruction.

Intel documents the BSR instruction as producing an "undefined value" when the input is 0, but setting the ZF in that case. But AMD documents it as leaving the destination unchanged:

AMD's BSF entry in AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions

... If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register. ...

On current Intel hardware, the actual behaviour matches AMD's documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting Index when the input is non-zero (and the intrinsic's return value is non-zero).

On Intel (but maybe not AMD), this goes as far as not even truncating a 64-bit register to 32-bit. e.g. mov rax,-1 ; bsf eax, ecx (with zeroed ECX) leaves RAX=-1 (64-bit), not the 0x00000000ffffffff you'd get from xor eax, 0. But with non-zero ECX, bsf eax, ecx has the usual effect of zero-extending into RAX, leaving for example RAX=3.


IDK why Intel still hasn't documented it. Perhaps a really old x86 CPU (like original 386?) implements it differently? Intel and AMD frequently go above and beyond what's documented in the x86 manuals in order to not break existing widely-used code (e.g. Windows), which might be how this started.

At this point it seems unlikely that Intel will ever drop that output dependency and leave actual garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open.

Skylake dropped the false dependency for lzcnt and tzcnt (and a later uarch dropped the false dep for popcnt) while still preserving the dependency for bsr/bsf. (Why does breaking the "output dependency" of LZCNT matter?)


Of course, since MSVC optimized away your index = 0 initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. So even if you wanted to, I don't think you could take advantage of the dst-unmodified behaviour even though it's guaranteed on AMD.

So in C++ terms, the intrinsic has no input dependency on index. But in asm, the instruction does have an input dependency on the dst register, like an add dst, src instruction. This can cause unexpected performance issues if compilers aren't careful.

Unfortunately on Intel hardware, the popcnt / lzcnt / tzcnt asm instructions also have a false dependency on their destination, even though the result never depends on it. Compilers work around this now that it's known, though, so you don't have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered).


You need to check it to make sure index is valid, unless you know the input was non-zero. e.g.

if(_BitScanReverse64(&idx, input)) {
    // idx is valid.
    // (MS docs say "Index was set")
} else {
    // input was zero, idx holds garbage.
    // (MS docs don't say Index was even set)
    idx = -1;     // might make sense, one lower than the result for bsr(1)
}

If you want to avoid this extra check branch, you can use the lzcnt instruction via different intrinsics if you're targeting new enough hardware (e.g. Intel Haswell or AMD Bulldozer IIRC). It "works" even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    The reason people say it is semi-documented is because AMD documents it. See [here](https://support.amd.com/TechDocs/24594.pdf), p. 112 (p 148 in the PDF). (I can't find an x86-32 manual from AMD, but I'd swear it said something similar.) – Cody Gray - on strike Dec 28 '16 at 13:52
  • Aside from that, your speculations about how the intrinsic works in MSVC are precisely correct. I've spent quite a bit of time playing with it, [discovered a few bugs (only for 32-bit builds)](https://connect.microsoft.com/VisualStudio/feedback/details/3101349/bitscanforward-intrinsic-always-results-in-generation-of-pointless-store-to-memory), and tried to figure out ways to take advantage of the "semi-documented" behavior. Unfortunately, there is no way other than inline asm. Also, you should beware that sub-optimal code generation is possible if you do not check the intrinsic's return value! – Cody Gray - on strike Dec 28 '16 at 14:01
  • 1
    Unfortunately, even the latest version of MSVC (VS 2015 Update 3) doesn't work around the false dependency in popcnt/lzcnt/tzcnt when you use the corresponding intrinsics. Haven't filed a bug/feature-request for that yet. They have ignored all of the bugs I've submitted so far. – Cody Gray - on strike Dec 28 '16 at 14:04
  • @CodyGray: gcc works around the false dep for popcnt/lzcnt/tzcnt, but not for bsf/bsr. /facepalm. :( I'm in the middle of typing up a bug report for that. Thanks for the links, especially to AMD's docs. – Peter Cordes Dec 29 '16 at 00:08