11

My profiler has identified the following function profiling as the hotspot.

typedef unsigned short ushort;

bool isInteriorTo( const std::vector<ushort>& point , const ushort* coord , const ushort dim )
{
    for( unsigned i = 0; i < dim; ++i )
    {
        if( point[i + 1] >= coord[i] ) return false;
    }

    return true;  
}

In particular one assembly instruction MOVZX (Move with Zero-Extend) is responsible for the bulk of the runtime. The if statement is compiled into

mov     rcx, QWORD PTR [rdi]
lea     r8d, [rax+1]
add     rsi, 2
movzx   r9d, WORD PTR [rsi-2]
mov     rax, r8
cmp     WORD PTR [rcx+r8*2], r9w
jae     .L5

I'd like to coax the compiler out of generating this instruction but I suppose I first need to understand why this instruction is generated. Why the widening/zero extension, considering that I'm working with the same data type?

(Find the entire function on godbolt compiler explorer.)

fuz
  • 88,405
  • 25
  • 200
  • 352
Olumide
  • 5,397
  • 10
  • 55
  • 104
  • Take a look at the gcc 7 output. I'll hazard a comments guess that x64 ISA no longer supports moving into 16-bit registers (e.g.) mov dx, 1 so it must sign extend the value into a larger register. In your case thats a 64-bit register, but in gcc7 its a 32-bit register. It can then compare the lower 16-bit portion of the register with the 16-bit of memory. – djgandy Apr 19 '17 at 09:52
  • @djgandy You can still move into a 16 bit register (e.g. using `mov r9w, word ptr [rsi-2]`) but doing so causes a costly partial register update which is to be avoided. `movzx` overwrites the entire register, improving performance. – fuz Apr 19 '17 at 09:54
  • @fuz good to know, and no doubt that's why the compiler would avoid that method. – djgandy Apr 19 '17 at 09:56
  • 8
    `Movzx reg32,[mem16]` is a lot faster than `mov reg16,[mem16]`. You should thank the compiler. – Johan Apr 19 '17 at 10:00
  • 5
    It is not the instruction that is expensive, it is the memory access. It isn't cached well. Pretty inevitable when the vector is large, there is no simple button you can push other than the one that says "make it smaller". Accessing memory is in general one of the most expensive things a processor has to do and how many dollars you spend on it matters. DDR4 came down in price surprisingly fast. – Hans Passant Apr 19 '17 at 10:03
  • Johan's comment is mostly correct. `movzx reg32, [mem16]` is actually probably going to be slightly slower than `mov reg16, [mem16]`, at least on Intel processors. But *overall*, `movzx` will be significantly faster because you won't pay the penalty of a partial register stall when you try to use the loaded value. This is why the compiler is generating the `movzx` instruction in the first place. But like others have pointed out, it is the memory access that explains your profiling results. There are a couple of other ways I'd optimize this code if writing it by hand, but not from a compiler. – Cody Gray - on strike Apr 19 '17 at 13:10
  • You can prove to yourself that it is the zero-extension itself that is slow by changing the vector to hold 32-bit `unsigned int` values. In this case, the compiler will emit a `MOV` instruction, but since you'll still have to perform the memory access, you'll still see this instruction as the hot-spot in your profiler. – Cody Gray - on strike Apr 19 '17 at 13:21
  • @Olumide, I have added a reference to the authoritative source - A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8. – Maxim Masiutin May 14 '17 at 14:20

2 Answers2

20

Thank you for the good question!

Clearing Registers and Dependency Breaking Idioms

A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:

Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core microarchitecture, a number of instructions can help clear execution dependency when software uses these instructions to clear register content to zero. Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependences on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

movzx vs mov

The compiler knows that movzx is not costly and uses it as often as possible. It may take more bytes to encode movzx than mov, but it is not expensive to execute.

Contrary to the logic, a program with movzx (that fills the entire registers) actually works faster than with just mov, which only sets lower parts of the registers.

Let me demonstrate this conclusion to you on the following code fragment. It is part of the code that implements CRC-32 calculation using the Slicing by-N algorithm. Here it is:

    movzx   ecx, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 2]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]
    
    skipped 6 more similar triplets that do movzx, shr, xor.
    
    dec     <<<a counter register >>>>
    jnz     …… <<repeat the whole loop again>>>

Here is the second code fragment. We have cleared ecx in advance, and now just instead of “movzx ecx, bl” do “mov cl, bl”:

    // ecx is already cleared here to 0

    mov     cl, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    mov     cl, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 2]

    mov     cl, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]
    
    <<< and so on – as in the example #1>>>

Now guess which of the two above code fragments runs faster? Did you think previously that the speed is the same, or the movzx version is slower? In fact, the movzx code is faster because all the CPUs since Pentium Pro do Out-Of-Order execution of instructions and register renaming.

Register Renaming

Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.

Let me just take the first 4 instructions from the first code fragment:

  1.     movzx   ecx, bl
    
  2.     shr     ebx, 8
    
  3.     mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]
    
  4.     movzx   ecx, bl
    

As you see, instruction 4 depends on instruction 2. Instruction 4 does not rely on the result of instruction 3.

So the CPU could execute instructions 3 and 4 in parallel (together), but instruction 3 uses the register (read-only) modified by instruction 4, thus instruction 4 may only start executing after instruction 3 fully completes. Let us then rename the register ecx to edx after the first triplet to avoid this dependency:

    movzx   ecx, bl
    shr     ebx, 8
    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]

    movzx   edx, bl
    shr     ebx, 8
    xor     eax, dword ptr [edx * 4 + edi + 1024 * 2]

    movzx   ecx, bl
    shr     ebx, 8
    xor     eax, dword ptr [ecx * 4 + edi + 1024 * 1]

Here is what we have now:

  1.     movzx   ecx, bl
    
  2.     shr     ebx, 8
    
  3.     mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]
    
  4.     movzx   edx, bl
    

Now instruction 4 in no way uses any register needed for instruction 3, and vice versa, so instructions 3 and 4 can execute simultaneously for sure!

This is what the CPU does for us. The CPU, when translating instructions to micro-operations (micro-ops) which the Out-of-order algorithm will execute, renames the registers internally to eliminate these dependencies, so the micro-ops deal with renamed, internal registers, rather than with the real ones as we know them. Thus we don't need to rename registers ourselves as I have just renamed in the above example – the CPU will automatically rename everything for us while translating instructions to micro-ops.

The micro-ops of instruction 3 and instruction 4 will be executed in parallel, since micro-ops of instruction 4 will deal with entirely different internal register (exposed to outside as ecx) than micro-ops of instruction 3, so we don't need to rename anything.

Let me revert the code to the initial version. Here it is:

  1.     movzx   ecx, bl
    
  2.     shr     ebx, 8
    
  3.     mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]
    
  4.     movzx   ecx, bl
    

(instructions 3 and 4 run in parallel because ecx of instruction 3 is not that ecx as of instruction 4, but a different, renamed register – the CPU has automatically allocated for instruction 4 micro-ops a new, fresh register from the pool of internally available registers).

Now let us go back to movxz vs mov.

Movzx clears a register entirely, so the CPU for sure knows that we do not depend on any previous value that remained in higher bits of the register. When the CPU sees the movxz instruction, it knows that it can safely rename the register internally and execute the instruction in parallel with previous instructions. Now take the first 4 instructions from our example #2, where we use mov rather than movzx:

  1.    mov     cl, bl
    
  2.    shr     ebx, 8
    
  3.    mov     eax, dword ptr [ecx * 4 + edi + 1024 * 3]
    
  4.    mov     cl, bl
    

In this case, instruction 4, by modifying cl, modifies bits 0-7 of the ecx, leaving bits 8-32 unchanged. Thus the CPU cannot just rename the register for instruction 4 and allocate another, fresh register, because instruction 4 depends on bits 8-32 left from previous instructions. The CPU has to preserve bits 8-32 before it can execute instruction 4. Thus it cannot just rename the register. It will wait until instruction 3 completes before executing instruction 4. Instruction 4 didn't become fully independent - it depends on the previous value of ECX and the previous value of bl. So it depends on two registers at once. If we had used movzx, it would have depended on just one register - bl. Consequently, instructions 3 and 4 would not run in parallel because of their interdependence. Sad but true.

That's why it is always faster to operate complete registers. Suppose we need only to modify a part of the register. In that case, it's always quicker to alter the entire register (for example, use movzx) – to let the CPU know for sure that the register no longer depends on its previous value. Modifying complete registers allows the CPU to rename the register and let the Out-of-order execution algorithm execute this instruction together with the other instructions, rather than execute them one-by-one.

Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 2
    that's also the reason [most x64 instructions zero the upper part of a 32 bit register](https://stackoverflow.com/q/11177137/995714) because modifying the whole register breaks the dependency chain] – phuclv May 29 '17 at 14:27
  • 1
    If `// ecx is already cleared here to 0` is done with `xor ecx,ecx`, both versions will run with the same performance. :P xor-zeroing sets an internal "upper-bytes-zero" flag. It's hard to make a simple example that demonstrates the problem. Maybe try setting the low byte of a register with a byte `mov` vs. `and` and `or`. Like `mov cl, 0x12` vs. `and ecx, 0xffffff00` / `or ecx, 0x12`. – Peter Cordes Jul 14 '17 at 23:00
  • 1
    (Sorry I'm picking on all your byte-register answers right now. I was looking for [this recent comment thread](https://stackoverflow.com/a/43812936/224132) after seeing the [KNL quote](https://stackoverflow.com/a/45101480/224132) from the Intel manual in your recent answer.) – Peter Cordes Jul 14 '17 at 23:03
15

The movzx instruction zero extends a quantity into a register of larger size. In your case, a word (two bytes) is zero extended into a dword (four bytes). Zero extending itself is usually free, the slow part is loading the memory operand WORD PTR [rsi-2] from RAM.

To speed this up, you can try to ensure that the datum you want to fetch from RAM is in the L1 cache at the time you need it. You can do this by placing strategic prefetch intrinsics into an appropriate place. For example, assuming that one cache line is 64 bytes, you could add a prefetch intrinsic to fetch array entry i + 32 every time you go through the loop.

You can also consider an algorithmic improvement such that less data needs to be fetched from memory, but that seems unlikely to be possible.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • 3
    This is exactly correct. On modern Core i7 processors, MOVZX reg, mem has *identical* latency to MOV reg, mem. – icecreamsword Apr 19 '17 at 17:45
  • 1
    That's true on modern Intel processors, but *not* true historically. However, all the way back to the Pentium Pro, you had the significant penalty of partial register stalls, which meant that MOVZX was still a net performance win. The exception was that you could write the code to clear the entire register first (XOR reg, reg), and then load only a lower 16-bit or 8-bit alias. This didn't break the dependency on the PPro (which didn't really do dependency-breaking), but it did on later processors, and was often slightly faster than MOVZX, given that instruction's historically high latency. – Cody Gray - on strike Apr 20 '17 at 06:21
  • And not just latency, but MOVZX has an additional instruction prefix, and historically, the more prefixes, the longer it takes to decode an instruction, which means throughput is decreased also. Anyway, to be nitpicky, saying "modern Core i7 processors" doesn't really mean very much. There's nothing different in the microarchitecture of an i7 versus an i5 or i3 (or even a Pentium or Celeron). What actually matters is the microarchitecture, and even on Skylake, which elides (renames) reg-reg moves, MOVZX is *not* eliminated, so still has some cost, just no significant latency. – Cody Gray - on strike Apr 20 '17 at 06:25
  • 3
    @CodyGray: Zero-extending *loads* are a separate thing from the reg-reg form of movzx. `movzx r32, word [mem]` is a pure load, handled by the load port. It's not a micro-fused ALU-movzx + load. This is true according to Agner's tables even on P6 (Pentium II). It's the same as a MOV-load (*if* you avoid partial-reg stalls by only reading the r16 after the load). 0F escape bytes don't count as prefixes for the 1-prefix limit for the simple decoders in P6/PM (before Core2). Silvermont does count 0F, and Agner comments that this is unlike other Intel/AMD CPUs with prefix limits. – Peter Cordes Jul 14 '17 at 21:21
  • Also `movzx r32, r8-low` *is* eliminated on Skylake (unless src and dst are parts of the same register). – Peter Cordes Jul 14 '17 at 21:23
  • 2
    Right answer, but wrong suggestion, IMO. Simply adding software prefetch without unrolling to only prefetch once per cache line could make it slower. Hardware prefetch should do well for simple sequential access (and easily keep up with a `word` loop). Maybe the OP is working with multiple short vectors, and that's why they're getting cache misses. Or on pre-IvyBridge, prefetch didn't cross page boundaries, I think. Or maybe this is CPU bottlenecked, and the profiler counts had to go somewhere. – Peter Cordes Jul 14 '17 at 21:28
  • I distinctly remember profiling telling me that `MOVZX` was considerably slower than `MOV` on PPro, PII, and PIII. I blamed the prefix, but maybe that wasn't the culprit. Good point, though, about loads being different from reg-reg. I very likely have blurred those together in my mind. – Cody Gray - on strike Jul 16 '17 at 11:40
  • (Correction: the IvyBridge new feature Intel describes as "next-page prefetching" is *TLB* prefetch. Data prefetch still doesn't necessarily happen from cache lines in another page. (The HW prefetcher in L2 works on physical addresses and doesn't want to assume that virtual pages are backed by contiguous physical RAM.) – Peter Cordes May 01 '21 at 10:02