6

I'm running a Core i7 3930k, which is of the Sandy Bridge microarchitecture. When executing the following code (compiled under MSVC19, VS2015), the results surprised me (see in comments):

int wmain(int argc, wchar_t* argv[])
{
    uint64_t r = 0b1110'0000'0000'0000ULL;
    uint64_t tzcnt = _tzcnt_u64(r);
    cout << tzcnt << endl; // prints 13

    int info[4]{};
    __cpuidex(info, 7, 0);
    int ebx = info[1];
    cout << bitset<32>(ebx) << endl; // prints 32 zeros (including the bmi1 bit)

    return 0;
}

Disassembly shows that the tzcnt instruction is indeed emitted from the intrinsic:

    uint64_t r = 0b1110'0000'0000'0000ULL;
00007FF64B44877F 48 C7 45 08 00 E0 00 00 mov         qword ptr [r],0E000h  
    uint64_t tzcnt = _tzcnt_u64(r);
00007FF64B448787 F3 48 0F BC 45 08    tzcnt       rax,qword ptr [r]  
00007FF64B44878D 48 89 45 28          mov         qword ptr [tzcnt],rax  

How come I'm not getting an #UD invalid opcode exception, the instruction functions correctly, and the CPU reports that it does not support the aforementioned instruction?

Could this be some weird microcode revision that contains an implementation for the instruction but doesn't report support for it (and others included in bmi1)?

I haven't checked the rest of the bmi1 instructions, but I'm wondering how common a phenomenon this is.

Johan
  • 74,508
  • 24
  • 191
  • 319
DoomMuffins
  • 1,174
  • 1
  • 9
  • 19
  • 1
    From the [Instruction Set Reference](http://www.felixcloutier.com/x86/LZCNT.html): _LZCNT differs from BSR. For example, LZCNT will produce the operand size when the input operand is zero. **It should be noted that on processors that do not support LZCNT, the instruction byte encoding is executed as BSR.**_ – Michael Petch May 10 '17 at 02:26
  • @Michael Petch you wrote about the wrong instruction but it seems that what you wrote applies to `TZCNT` and `BSF` as well. – DoomMuffins May 10 '17 at 06:17
  • Yes, sorry I quickly glanced at the question. The same thing applies to TZCNT and BSF as you have discovered. – Michael Petch May 10 '17 at 06:30
  • 1
    The "good" news is that `tzcnt` at least agrees with `bsf` for all values where `bsf` is defined. They differ in behavior only for a zero input, where `bsf` is undefined, and `tzcnt` returns 32 or 64 (for 32-bit or 64-bit inputs, respectively). `lzcnt` on the hand hand returns totally different results (essentially `31 - bsr`). – BeeOnRope May 10 '17 at 23:37

1 Answers1

6

The reason that Sandy Bridge (and earlier) processors seem to support lzcnt and tzcnt is that both instructions have a backward compatible encoding.

lzcnt eax,eax  = rep bsr eax,eax
tzcnt eax,eax  = rep bsf eax,eax

On older processors the rep prefix is simply ignored.

So much for the good news.
The bad news is that the semantics of both versions are different.

lzcnt eax,zero => eax = 32, CF=1, ZF=0  
bsr eax,zero   => eax = undefined, ZF=1
lzcnt eax,0xFFFFFFFF => eax=0, CF=0, ZF=1   //dest=number of msb leading zeros
bsr eax,0xFFFFFFFF => eax=31, ZF=0        //dest = bit index of highest set bit


tzcnt eax,zero => eax = 32, CF=1, ZF=0
bsf eax,zero   => eax = undefined, ZF=1
tzcnt eax,0xFFFFFFFF => eax=0, CF=0, ZF=1   //dest=number of lsb trailing zeros
bsf eax,0xFFFFFFFF => eax=0, ZF=0        //dest = bit index of lowest set bit

At least bsf and tzcnt generate the same output when source <> 0. bsr and lzcnt do not agree on that.
Also lzcnt and tzcnt execute much faster than bsr/bsf.
It totally sucks that bsf and tzcnt cannot agree on the flag usage. This needless inconsistancy means that I cannot use tzcnt as a drop-in replacement for bsf unless I can be sure its source is non-zero.

Johan
  • 74,508
  • 24
  • 191
  • 319
  • Only AMD CPUs execute lzcnt/tzcnt faster than `bsr`/`bsf`. (Also, Intel Skylake makes lz/tzcnt have no output dependency, which they did on previous uarches. `popcnt` still has an output dependency on Skylake.) BSR/BSF always have a dep on the output, because they leave the output unmodified for input=0. (AMD documents this behaviour, Intel implements it but says "undefined" in their manual). – Peter Cordes Aug 13 '17 at 04:33
  • 1
    In any case where you know the input is non-zero, you can still use `tzcnt` as a drop-in replacement without knowing whether it will decode as `tzcnt` or `bsf`. Compilers actually do this. – Peter Cordes Aug 13 '17 at 04:33
  • 1
    My guess is that AMD can't set flags based on the *input* of a single uop. Maybe the way it handles ZF is hard-wired to set it based on the results of ALU uops, so BSF/BSR needing to set it based on the inputs is just not compatible. (And that may be why they made the flag-result incompatible when defining `lzcnt` / `tzcnt`, since I think it was AMD that introduced them first.) I agree it would be nice if `tzcnt` set flags the same as `bsf`, so you could use it in more cases when it might run as `bsf`, but that might have stopped AMD from making it (as) fast. – Peter Cordes Aug 17 '17 at 06:50
  • @PeterCordes/Johan: Does this mean using `tzcnt` intrinsics on a CPU that doesn't advertise it is actually dangerous? i.e. Do you know if compilers account for the fact that older CPUs would run it without modifying the flag? Meaning it could result in a subtle bug in instructions that expect a particular flag value? – user541686 Jan 27 '23 at 03:46
  • @user541686: The intrinsic only gives access to the integer output, not the return value. But yes, if you use the intrinsic and depend on getting `32` or `64` for an input of 0, the failure mode for running on a CPU that doesn't support `tzcnt` is silent wrong answers, rather than the usual SIGILL (illegal instruction fault.) GCC/clang don't let you use intrinsics for instruction-sets you haven't told them the target supports, e.g. `-march=haswell`. On MSVC/ICC, it's up to you not to do that. – Peter Cordes Jan 27 '23 at 05:11
  • @user541686: I haven't seen compilers make asm that uses the FLAGS output from `bsf` or `tzcnt`, or code that intentionally runs `bsf` with input = 0. `__builtin_ctz` intentionally gives an undefined result for input=0. Gcc/clang *do* know that `tzcnt` is a drop-in replacement for `bsf` when the input=0 case doesn't matter or can't happen, e.g. using `rep bsf` so it'll run faster on AMD, same speed on Intel, unless you use `-march=` some specific Intel CPU. – Peter Cordes Jan 27 '23 at 05:14
  • @PeterCordes: Ah. The carry flag difference is what I was worried about on older CPUs, but yeah, it doesn't seem like it could break anything. Thanks. – user541686 Jan 27 '23 at 06:02
  • 1
    @PeterCordes - `clang` seems to be able to use both CF and ZF after `tzcnt`. [A couple of examples.](https://gcc.godbolt.org/z/qjbKPa9b5). – BeeOnRope Jan 27 '23 at 13:02
  • @BeeOnRope: Oh cool. I guess my memory was based mostly on GCC. Clang is also able to use the FLAGS output from `bsf`. https://gcc.godbolt.org/z/q5K1767TK (Even with tune=generic clang doesn't use `rep bsf` aka `tzcnt` when the input=0 case doesn't matter, so writing code that gets clang to use the FLAGS output of `bsf` doesn't lock you in to missing out on `tzcnt` performance on AMD, since clang would always miss that opportunity.) – Peter Cordes Jan 27 '23 at 14:45