Confusion about bsr and lzcnt

Question

I'm a bit confused about both instructions. First let's discard the special case when the scanned value is 0 and the undefined/bsr or bitsize/lzcnt result - this difference is clear and not part of my question.

Let's take the binary value 0001 1111 1111 1111 1111 1111 1111 1111

According to Intel's spec the result for lzcnt is 3

According to Intel's spec the result for bsr is 28

lzcnt counts, bsr returns the index or distance from bit 0 (which is the lsb).

How can both instructions be the same and how can lzcnt be emulated as bsr in case there's no BMI on the CPU available? Or is bit 0 in case of bsr the msb? Both "code operations" in Intel's spec are different too, one counts or indexes from the left, the other from the right.

Maybe someone can shed some light on this, I have no CPU without BMI/lzcnt instruction to test if the fallback to bsr works with the same result (as the special case of value 0 to scan never happens).

score 17 · Accepted Answer · answered Sep 05 '14 at 10:28

17

LZCNT gives the number of leading zero bits. BSR gives the bit index of the most significant 1 bit. So they do effectively the same thing for the non-zero case, except the result is interpreted differently. Therefore you can just subtract the BSR result from 31 to get the same behaviour as with LZCNT, i.e. LZCNT == (31 - BSR).

answered Sep 05 '14 at 10:28

Paul R

208,748
37
389
560

Now you confuse me even more. Sorry. If the result is different, then they do effectively not the same thing. They look for the same bit, but in a program I rely on the result. And if the result is different for lzcnt and lzcnt-as-bsr-emulation-on-non-bmi-cpu then I can't use it and must test for BMI. (special doesn't matter, scanned value is never 0). So why does lzcnt fallback to bsr at all? – hopperpl Sep 05 '14 at 10:42
There is no "fallback" - they are two different instructions - you would need to either use conditional assembly, or do a run-time check and maybe implement some kind of dispatcher, so that you either use `LZCNT` or `BSR` + `SUB`. – Paul R Sep 05 '14 at 10:59
7

Now I got it. On CPUs without BMI the lzcnt instruction does not #ud but executes bsr instead. With a different result than lzcnt. I misinterpreted the spec here. Disregarding the special case when the scanned value is zero, tzcnt and bsf behave the same, but lzcnt and bsr do not. And a lot of webpages say that both are. Which is obviously incorrect and the reason for my confusion. Thanks for clearing that up. – hopperpl Sep 05 '14 at 19:24
Sorry - I think I may have been wrong after all - I was just re-reading the Intel docs and it looks like there may be some internal mechanism for translating LZCNT to BSR when LZCNT is not available, but the manual does not make much sense on this. You'll just have to experiment. – Paul R Sep 08 '14 at 10:07
7

By the way (and late), you can also implement it as `LZCNT = BSR ^ 31`, which saves an instruction – harold Jul 01 '15 at 10:55
1

@harold This is actually how GCC's `__builtin_clz` is implemented. If you do `__builtin_clz(x) ^ 31`, it yields a single `BSR` since the XORs cancel out. – minmaxavg Oct 03 '18 at 21:17

BeeOnRope · Answer 2 · 2017-09-09T02:00:42.067

12

To be clear, there is no working fallback from lzcnt to bsr. What happened is that Intel used the previously redundant sequence rep bsr to encode the new lzcnt instruction. Using a redudant rep prefix for bsr was generally defined to be ignored, but with the caveat that it may decode differently on future CPUs¹.

So if you happen to execute lzcnt on a CPU that doesn't support it, it will execute as bsr. Of course, this fallback is not exactly intentional, and it gives the wrong result (as Paul R points out they look at the same bit but report it differently): it is just a consequence of the way the new instruction was encoded and how pointless rep prefixes were treated by prior CPUs. So the world fallback is pretty much entirely inappropriate for lzcnt and bsr.

The situation is more subtle for the case of tzcnt and bsf. It uses the same encoding trick: tzcnt has the same encoding as rep bsf, but here the "fallback" mostly works since tzcnt returns the same value as bsf for all inputs except zero. For zero inputs tzcnt returns 32, but bsf leaves the destination undefined.

You can't really use even this fallback though: if you never have zero inputs you might as well just use bsf, saving a byte and being compatible with a couple decades of CPUs, and if you do have zero inputs the behavior differs.

So the behavior is perhaps better classified as trivia than a fallback...

¹ Normally this would more or less be esoterica, but you could for example use rep prefixes are where they have no functional effect to lengthen instructions to help align subsequent code without inserting explicit nop instructions. Given the "may decode differently in the future" this would be dangerous when compiling code which may run on any future CPU.

edited Sep 09 '17 at 02:00

answered Apr 17 '17 at 00:47

BeeOnRope

60,350
16
207
386

I believe the usual "undefined" behaviour of BSF and BSR is to leave the destination unmodified when the source operand is 0. In that case, even when the source operand is guaranteed not be 0, it would be better to use TZCNT or LZCNT ^ 31 over BSF or BSR respectively because the newer instruction wouldn't have a dependency on the old value of the source register. Or at least in theory, apparently this doesn't work in practice: http://stackoverflow.com/questions/21390165/why-does-breaking-the-output-dependency-of-lzcnt-matter – Ross Ridge Apr 17 '17 at 01:44
Indeed, several of the two operand instructions that actually overwrite their first operand (rather than also using it as an input) have this false dependency (see also `popcnt`). Perhaps it will be fixed in a future architecture. AFAIK not all chips implemented the behavior as "source unchanged" - apparently some AMD chips set it to zero or something. @RossRidge – BeeOnRope Apr 17 '17 at 01:47
AMD's AMD64 manual ([March 2017 version](http://support.amd.com/TechDocs/24594.pdf)) explicitly documents the dest-unchanged behaviour.: *If the second operand contains 0, the instruction sets ZF to 1 **and does not change the contents of the destination register**.* Fun fact: @RossRidge: Skylake fixes the false-dep for `tz/lzcnt`, but not `popcnt`. (IIRC, there's a SKL erratum for popcnt having a false dep, so maybe they meant to. :P) – Peter Cordes Sep 09 '17 at 01:37
`rep nop` is `pause`. The padding prefixes usually used with NOP are `66` operand-size. I *think* Intel says something about CPUs ignoring prefixes that don't apply to an instruction. They point out that future CPUs could decode it differently, which is exactly what happens here, but the "compatibility" with old CPUs is fairly well defined. For this reason, some compilers (with `-mtune=generic`) will use `tzcnt` instead of `bsf` when the input is known non-zero, because `tzcnt` is much faster on AMD. I suspect that setting flags from the input operand instead of output is what hurts them. – Peter Cordes Sep 09 '17 at 01:41
@PeterCordes - good point. I updated the answer to indicate that Intel and AMD apparently allowed the redundant `rep` prefixes, but kept open the option of decoding them differently in the future - which made them effectively useless for compile-to-native code that might even run on another CPU (but perhaps still useful for JIT or AOT compiler code), since you could never be sure your binary wouldn't break. Lately it seems they are being stricter with their prefix behavior, usually defining exact semantics and #GPing if not followed... – BeeOnRope Sep 09 '17 at 02:02
... which probably helps decoding in addition to preventing people from intentionally using non-portable behavior at the machine-code level. – BeeOnRope Sep 09 '17 at 02:03
@BeeOnRope: Yeah, for fields within VEX. I think in hardware it's actually easier to just ignore bits you don't need. Modern transistor budgets mean they can spend them being strict now to make sure they don't get abused and result in needing to support some more crap and not be able to use them for future extensions. And yeah, there are few cases where ignored-prefixes are safe to use, except already-widespread stuff like `rep ret` padding for AMD pre-Bulldozer branch predictors. It's entrenched thanks to gcc enabling at `-mtune=generic`. https://stackoverflow.com/a/32347393/224132 – Peter Cordes Sep 09 '17 at 02:07
@PeterCordes I'm not really referring to fields within prefixes, but the prefixes and their order themselves. It is hugely beneficial for the decoders to know that a certain type of prefix which might change the interpretation of the rest of the instruction appears in a specific position, or that there can be only one prefix that modifies a certain behavior rather than several with some type of resolution rules. For other types of prefixes with more minor impacts, yeah it might be slightly easier to ignore multiples, but that effect is probably very small. – BeeOnRope Sep 09 '17 at 02:10
... the most efficient to decode, however, is probably when you put a bunch of constraints, but then leave anything other than that "undefined" rather than requiring a fault. Then you pretty much have complete implementation flexibility, including allowing invalid sequences to decode as "what you'd expect" or something else entirely, or triggering a fault, etc. Basically the C/C++ approach to hardware :) – BeeOnRope Sep 09 '17 at 02:13
Oh right, like only one REX or only one VEX. The historical behaviour comes from when prefixes were decoded one at a time. The last prefix of a type was the one that left its value in that part of the decoder. So yeah, for parallel decoding it's probably easier to fault on multiples than select the right REX prefix. Decoders already have to be able to signal #UD. I agree it's probably good they didn't leave a lot of stuff undefined but happens-to-work, because testing on current hardware is often all some people do. – Peter Cordes Sep 09 '17 at 02:15
Exactly. So we can say that the old rules were based on what made the old uarch easy/fast and the new rules are what made the new uarch easy/fast. I like the new rules for their their inherent strictness and lack of redundancy (probably a side effect, not really a design goal). – BeeOnRope Sep 09 '17 at 02:17
My claim that x86 manuals say non-applicable prefixes should be ignore may be wrong. Intel's vol.2 manual (in section 2.1.1 prefixes) currently says "*Use of repeat prefixes and/or undefined opcodes with other Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable behavior.*", with similar language for other prefixes. Working on an answer for https://stackoverflow.com/questions/46188388/what-happens-when-you-use-a-memory-override-prefix-but-all-the-operands-are-regi which asks this directly. – Peter Cordes Sep 13 '17 at 04:33

Confusion about bsr and lzcnt

2 Answers2

Linked