Why does MSVC emit a useless MOVSX before performing this Bit Test?

Question

Compiling the following code in MSVC 2013, 64-bit release build, /O2 optimization:

while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
    ++s;
}

I get the following code - which has a really cool optimization using a 64-bit register as a lookup table with the bt (bit test) instruction.

    mov     rcx, 17596481020928             ; 0000100100002400H
    npad    5
$LL82@myFunc:
    movzx   eax, BYTE PTR [rsi]
    cmp     al, 44                          ; 0000002cH
    ja      SHORT $LN81@myFunc
    movsx   rax, al
    bt      rcx, rax
    jae     SHORT $LN81@myFunc
    inc     rsi
    jmp     SHORT $LL82@myFunc
$LN81@myFunc:
    ; code after loop...

But my question is: what is the purpose of the movsx rax, al after the first branch?

First we load a byte from the string into rax and zero-extend it:

movzx eax, BYTE PTR [rsi]

Then the cmp/ja pair performs an unsigned comparison between al and 44, and branches forwards if al is greater.

So now, we know 0 <= al <= 44 in unsigned numbers. Therefore, the highest bit of al could not possibly be set!

Nonetheless, the next instruction is movsx rax, al. This is a sign-extended move. But since:

al is the lowest byte of rax
we already know the other 7 bytes of rax are zeroed
we just proved that al's highest bit could not possibly be set

this movsx must be a no-op.

Why does MSVC do it? I'm assuming it's not for padding, since in that case another npad would make the meaning clearer. Is it flushing data dependencies or something?

(By the way, this bt optimization really makes me happy. Some interesting facts: it runs in 0.6x the time of the 4 cmp/je pairs you might expect, it's way faster than strspn or std::string::find_first_not_of, and it only happens in 64-bit builds even if the characters of interest have values under 32.)

Its been awhile since I dove into asm (too long in fact), but "we already know the other 7 bytes are zeroed" - You know this *how* ? If they weren't before the `movsx`, they certainly are after that instruction. Doesn't the `movzx eax` prior instruction only zero-extend to 32 bits, while the `movsx rax` instruction sign-extend to 64 bits (i.e. the remaining 4 bytes)? Apologies in advance if that isn't correct; as I said its been awhile. — WhozCraig, Sep 30 '14 at 15:52
I am 95% sure that all instructions with 32-bit destination registers implicitly zero the upper 32 bits in 64-bit code. Looking for an authoritative source now. — japreiss, Sep 30 '14 at 16:04
@WhozCraig writing to a 32bit register automatically zeroes out the highest 32bits. (also, it doesn't even matter: only the bottom 6 bits of `rax` are relevant) — harold, Sep 30 '14 at 16:05
This appears to have something to do with the lower bound at 0. Change *s == ' ' to *s == 'A' and the instruction disappears :) — Hans Passant, Sep 30 '14 at 16:27
Interesting, Hans. So this loop is in a function `char const *advance(char const *s)` that does nothing else besides return `s` at the end. If I make your change, the instruction goes away in the standalone `PROC` - but in the place where it's inlined inside a larger algorithm, the `movsx` is still there! — japreiss, Sep 30 '14 at 17:22
Have you benchmarked both versions? There are apparently a number of subtle CPU bugs which cause false dependencies, and sometimes an extra statement can kill that dependency causing the longer code to run faster - the extra instruction runs in negative time ! — MSalters, Sep 30 '14 at 17:40
I'm not 100% sure but I believe that instruction is being inserted by the back-end for partial register stalling purposes. This is extremely architecture-dependent and compiler heuristics probably think this is the best way to avoid it. — Marco A., Oct 02 '14 at 22:36
@MarcoA. -- avoiding a partial reg stall would be the only logical explanation for this not being considered a missed-optimization bug, yes. Do you mind expanding this into an answer? (If you don't feel comfortable tackling that topic, I can write something up instead.) — LThode, Jan 22 '15 at 14:41
I don't understand why [ICC 16 and 17 don't use this interesting trick](https://gcc.godbolt.org/) even though Clang and GCC (either for x86 or other architectures) all can apply this optimization — phuclv, Nov 08 '16 at 16:36

Hans Passant · Answer 1 · 2014-10-03T03:45:04.740

You surely recognize that this optimization was produced by very specific code in the optimizer that looked for the pattern. Just the generation of the bit-mask gives it away. Yes, nice trick.

There are two basic codegen cases here. First one is the more universal one, where (charmax - charmin <= 64) but charmax >= 64. The optimizer needs to generate different code from what you saw, it needs to subtract charmin. That version does not have the MOVSX instruction. You can see it by replacing *s == ' ' by *s == 'A'.

Then there's the special case that you tested, all character codes to test happen to be < 64. The Microsoft programmer did deal with this in his code, he made sure not to generate a silly SUB EAX,0 instruction. But overlooked that generating the MOVSX wasn't necessary. Surely missed by only checking for optimal code in the general case. And a general function call in the code, so easy to overlook, note how the instruction changes to MOVZX when you compile with /J. Otherwise easily deemed necessary, there is no BT instruction that takes an 8-bit register as the 2nd operand so the AL register load isn't enough by itself.

There could be a hypothetical post-optimizer optimizer that optimizes the optimized code generated by the optimizer. And decided to keep MOVSX to improve superscalar execution. I seriously doubt it exists.

+1 for `post-optimizer optimizer that optimizes the optimized code generated by the optimizer.` — user703016, Oct 03 '14 at 00:18

SDwarfs · Answer 2 · 2014-10-09T14:53:45.420

As Hans Passant already mentioned your code is a special case of the optimization. I did not look into the compiler/optimizer sources so I can only say what I expect to happen. However, here is my explanation for this.

Your code in C / C++:

while (*s == ' ' || *s == ',' || *s == '\r' || *s == '\n') {
    ++s;
}

is equivalent to:

while (*s == 32 || *s == 44 || *s == 13 || *s == 12) {
    ++s;
}

or:

while (*s == 12 || *s == 13 || *s == 32 || *s == 44) {
   ++s;
}

The optimizer detects that there is an "if" with multiple (>=3 times) comparisons of the same character. Now, the optimizer generates as many 64bit patterns as needed (up to 4 patterns for an 8 bit char, 64*4 => all 256 possible values). Each of this bit patterns has an offset (= test-value to which the first bit in the pattern corresponds to) that needs to be subtracted from the value in test. In your case only one 64bit-pattern is needed for the values ranging from 12 to 44. So your code gets converted into some generic code like this:

#define ranged_bittest(pattern, value, min_val, max_val) \
  (val >= min_val) && (val <= max_val) && BT_with_offset(pattern, val, min_val)

while ( !ranged_bittest(<BITPATTERN>, *s, 12, 44) ) {
    ++s;
}

Now some other optimization takes ranged_bittest(<BITPATTERN>, *s, 12, 44); as it detects a bittest with start offset 12 and a pattern that can be safely shiftet left by 12 bits (as the upper 12 bits of pattern are zero). ranged_bittest(<BITPATTERN>, *s, 12, 44) => ranged_bittest(<BITPATTERN> << 12, *s, 0, 44)

This evolves to:

while (!(val >= 0 && val <= 44 && BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0))) {
    ++s;
}

The val >= 0 && gets optimized away (as it is assumed to be always true) and the "while" is translated to some "do"-loop with breaks; (which is better for branch prediction in most cases) resulting in:

do {
  if (val > 44) break;
  if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;

  ++s;
} while (true);

For any reason the optimization stops here (e.g. because there is a limit on how often further optimizations are applied to a code fragment for performance reasons).

Now in assembly the line

if (BT_with_offset(<SHIFTEDBITPATTERN>, *s, 0)) break;`

is translated to something like:

MOV <reg64_A>, const_pattern
MOV <reg64_B>, value
SUB <reg64_B>, const_offset
BT <reg64_A>, <reg64_B>

The assembly optimizer reduces:

MOV <reg64_B>, value
SUB <reg64_B>, 0

to just

MOVSX <reg64_B>, value

in your special case this is:

MOVSX rax, al

The assembly optimizer (somehow) does not have enough information to completely eliminate the sign extension. Maybe the assembly optimizer is quite "dumb" (can't do any logical expression optimizations and stuff, just simple replacements) or the full optimization level isn't activated yet.

It can't therefore remove the movsx rax,al, as it doesn't have any meta information about the possible values of rax or al at this point of code.

I don't KNOW, if this is true, but this my best guess for this case...

Sep Roland · Answer 3 · 2017-10-15T17:15:15.117

4

What struck me most when I first saw this code is how poorly in fact it is optimized.
Yes it's a neat trick to use a 64 bit register for a lookup table but ...

Why still use INC as opposed to the better ADD,1 ?
Why CMP ... JA ... MOVSX ... when we know that the BT instruction uses its source operand MOD 64 if its destination operand is a register (So there are 58 bits that need not be considered)?
Why not benefit from the fact that conditional branches are predicted to go backwards?
Why not reduce the number of branches?

I think a true assembler programmer would have written it more like this :

  mov     rcx, 0FFFFEFFEFFFFDBFFh  ;~0000100100002400h
  sub     rsi, 1
  npad    1
$LL82@myFunc:
  add     rsi, 1
  movzx   eax, BYTE PTR [rsi]   ;mov al,[rsi]
  test    al, 11000000b
  setz    bl
  test    bl, bl
  bt      rcx, rax
  ja      SHORT $LL82@myFunc

ja jumps if (CF or ZF) = 0

ZF=1 for AL=[64,255] and don't care about CF
ZF=0 for AL=[0,63] and CF=0 for AL={10,13,32,44}

For all the ASCII's in the range [64,255] will the test al, 11000000b instruction give a non-zero result (ZF=0). Because the combo setz bl test bl, bl is then used to flip the ZF to 1, there's no longer any chance for the ja instruction to continue the loop.
On the contrary, for all the ASCII's in the range [0,63] will the ZF end up being 0, thus allowing ja to fully interpret the CF gotten from the bt rcx, rax instruction.

_{Perhaps we are expecting to much from our optimizing compilers?}

edited Oct 15 '17 at 17:15

answered Oct 05 '14 at 19:19

Sep Roland

33,889
7
43
76

2

The `CMP/JA` can't be removed. For example, if `c == 'J' == 0x4A`, then the low 6 bits of `c` are `0xA == '\n'`, so we'd get the wrong answer. – japreiss Oct 07 '14 at 15:32
@japreis: It can be done without the first CMP/JA. Therefore, "JB" must be used (as user3144770 did) AND the pattern needs to be inverted (which was not done). Lets assume "bt rcx, rax" to load bit "rax" of "rcx" to "CF" and sets CF to zero in case rax is out of range (i.e. > 63). Then "JB" (= JUMP if "CF == 1") breaks the loop for all "out of range" cases. All bits in the pattern then would need to be 0 except the bits 12,13,32 and 44. – SDwarfs Oct 09 '14 at 15:12
why would *"and sets CF to zero in case rax is out of range"* be a valid assumption? – japreiss Oct 09 '14 at 15:34
@japreiss: It depends on the definition of the instruction set. There is most probable a exact definitions what happens in those cases. But maybe "BT reg64_x, reg64_y" also behaves like "BT reg64_x, (reg64_y % 64)" (meaning the lower 6 bits of reg64_y are used only to address the bit within reg64_x). In that case the above code of course doesnt work... Should have written "maybe can be done" ;) – SDwarfs Oct 09 '14 at 16:08
1

PS: Had a look at some definition of "BT" for the 386 CPU telling "[...] If the bit base operand specifies a register, the instruction takes the modulo 16 or 32 (depending on the register size) of the bit offset operand, allowing any bit position to be selected in a 16- or 32-bit register, respectively (see Figure 3-1)[...]" (http://faydoc.tripod.com/cpu/bt.htm). So, you are correct. The above code is unfortunately wrong. – SDwarfs Oct 09 '14 at 16:13
2

PPS: But `MOV rbx, rax` `AND rbx, 192` `BT rcx, rax` `JA @loop` would probably work in combination with a negated pattern (all 1 but 0 for bits 12,13,32,44). Because `MOV rbx, rax` `AND rbx, 192` set ZF=0 (zeroflag), if value <= 63. BT only changes the CF flag. And `JA @loop` jumps if (CF==0 and ZF==0). This would be two more instructions but none of them is a branch instruction. Still faster than the compilers solution in the original question. – SDwarfs Oct 09 '14 at 16:40
actually Intel compiler also emit inc and dec sometimes – phuclv Feb 03 '15 at 05:34
2

The premise that `add 1` is better than `inc` is almost wrong nowadays. ["gcc did a pretty good job here. It could save one code byte by using inc edx instead of add edx, 1, because nobody cares about P4 and its false-dependencies for partial-flag-modifying instructions"](http://stackoverflow.com/a/40355466/995714) – phuclv Nov 08 '16 at 15:51
2

`INC` is shorter, so it might make a different when cache miss is a concern. ["Some modern compilers (including clang-3.8, and Intel's ICC 13) do use inc when optimizing for speed (-O3), not just for size, when they don't need to check the carry flag afterwards. This saves 1 byte (64bit mode), or 2 bytes (inc r32 short form in 32bit mode), which makes a small percentage difference in total code size."](http://stackoverflow.com/a/36510865/995714) – phuclv Nov 08 '16 at 15:52
1

@SDwarfs: That AND/BT/JA idea is good on some CPUs (like Intel SnB-family). On earlier Intel (PPro to Nehalem), you will partial-flag stall from reading ZF and CF when the last flag-setting instruction only wrote CF. And BTW, you don't need MOV, you just need `test al, 192` (after a `movzx` load). See http://agner.org/optimize/. It's pretty reasonable for a compiler to emit a `cmp`/`jcc .out_of_loop` before the `bt` check. But Sep is right: a better structured loop could use a jcc as the loop branch instead of a JMP. (please fix the code in the answer so I can upvote :) – Peter Cordes Oct 11 '17 at 22:38
1

@PeterCordes I (finally) fixed the code in this anwser. I used the suggestions by SDwarfs but found that the ZF was just the opposite of what was needed. – Sep Roland Oct 15 '17 at 17:21
1

Hmm. I think I'd go for `not eax` / `test al, 11000000b`. Or just `cmp / jnz` + `bt / jnc`, because modern CPUs can macro-fuse cmp/j[n]z into a single uop. (Haswell and later can run a predicted-not-taken branch and a predicted-taken branch in the same cycle on different ports, too, so the loop could bottleneck on just the front-end at 1.25 cycles per iteration for 5 fused-domain uops. – Peter Cordes Oct 15 '17 at 22:19
Pre-Haswell, yeah combining into one `ja` might be worth it. NOT/TEST would make it 5 ALU uops (6 total uops) for Sandybridge/Ivybridge, which only has 3 ALU ports and one JCC per clock throughput. So this could run at 5/3 cycles per iter, vs. 2c for bottlenecking on 2 JCCs, or 2c for yours which bottlenecks on 6 ALU uops per iter. – Peter Cordes Oct 15 '17 at 22:21
I haven't looked at how this would run on AMD, and I don't know the cost of flag-merging on AMD. Which reminds me, Sandybridge would need a flag-merging uop here, which brings it up to 6 ALU uops (or your version up to 7), so it's not a win there unless it improves branch prediction. Silvermont I think always decodes `bt` to 2 uops, one of them being a flag-merge, so maybe worth it there because it doesn't macro-fuse jcc. – Peter Cordes Oct 15 '17 at 22:30
1

@PeterCordes the problem with the "two branch" idea is that the `< 64` branch is likely to be quite poorly predicted since the interesting ASCII/UTF-8 characters like on both size of that boundary. Of course, it depends on the input, but for "usual" text data it's probably going to be quite unpredictable and harmful for performance. Some possibilities I can think of are choosing the `bt` lhs from a 4-element LUT based on the two high bits of the char, or even a 256-byte LUT (asymptotically faster, larger) or 256-bit LUT (slower, smaller). – BeeOnRope Oct 16 '17 at 00:36
If you're willing to [check alignment](https://stackoverflow.com/q/37800739/149138) this is trivially vectorizable with 4 `cmpb` and some `or`. If you want to stick to GP regs, you could use the "any byte equal to N" [bithack](https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord) applied 4 times, against 8 bytes, which would likely be faster than the above (only a few ALU ops per byte with high IPC), although then you need additional code on loop exit to determine the actual byte within the word. – BeeOnRope Oct 16 '17 at 00:40
1

Another interesting type of idea is to search for some type of hash function which maps the 4 characters uniquely to some easily testable state. In any extreme example, if the 4 characters differed only in bits 1 and 2, like `0b11001, 0b11011, 0b11101, 0b11111`, such a hash function would be simply be `v & 0b11111001` which maps only those 4 bytes to `0b11001`, which you can simply check with a single test. Of course, the function for these 4 characters isn't so simple, but perhaps one can be found. – BeeOnRope Oct 16 '17 at 00:44
You can relax the requirements a bit too: false positives (i.e., bytes other than the chosen 4 mapping to the same value) are OK if you add an additional, slower check on exit. Such false positives could be very rare (especially if you know you're dealing with ASCII text and pick the false positive bytes to be characters that rarely appear) - so you'd only mostly pay the price of the 1-off accurate check on loop exit and not much mispredict penalty due to the false positives. – BeeOnRope Oct 16 '17 at 00:46
@BeeOnRope: Good point about mispredicts with separate branches, I'd forgotten what exactly this loop was doing. So probably `not` / `test` / `bt` / `ja` would do well for pure scalar. (Worth destroying `al` and having to reload it on loop exit if you expect many iterations.) SSE4.2 `pcmpistri` might do well here looking for 4 different values. – Peter Cordes Oct 16 '17 at 00:59
re: partial flags shenanigans: Broadwell and later will run this without any flag-merging uops: [What is a Partial Flag Stall?](https://stackoverflow.com/q/49867597) - `ja` simply has two separate input dependencies, one on CF and one on the other flags (SPAZO). I don't know about AMD. – Peter Cordes Jan 29 '23 at 16:58

Why does MSVC emit a useless MOVSX before performing this Bit Test?

3 Answers3

Linked