Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

Question

I've found that

mov al, bl
mov ah, bh

is much faster than

mov ax, bx

Can anyone explain me why? I'm running on Core 2 Duo 3 Ghz, in 32-bit mode under Windows XP. Compiling using NASM and then linking with VS2010. Nasm compile command:

nasm -f coff -o triangle.o triangle.asm

Here is the main loop I'm using to render a triangle:

; some variables on stack
%define cr  DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]

loop:

add esi, dcg
mov eax, esi
shr eax, 8

add edi, dcb
mov ebx, edi
shr ebx, 16
mov bh, ah

mov eax, cr
add eax, dcr
mov cr, eax

mov ah, bh  ; faster
mov al, bl
;mov ax, bx

mov DWORD [edx], eax

add edx, 4

dec ecx
jge loop

I can provide whole VS project with sources for testing.

This certainly does not make **any** sense at all. you have two instructions to decode and execute instead of one. Could this be affected by other things, like the size of the loop? How much is "much faster"? How reproducible is it? — Nathan Fellman, Aug 11 '11 at 19:45
I'm guessing here, but could it be because there's already usage of 8-bit partial registers `mov bh, ah`? I think it might be the case that a version using only 32-bit registers might be faster than either of the presented versions. — user786653, Aug 12 '11 at 08:33
@NathanFellman, using 16-bit operands in 32 bits code forces a context switch, it is the source of the slowdown, not the single instruction. — Johan, Oct 03 '11 at 16:54
@Johan: the CPU certainly does not force a context switch, and the OS shouldn't care about this. Who would force such a context switch? — Nathan Fellman, Oct 03 '11 at 20:47
@NathanFellman, you are talking about a very different context switch than I am. I don't know where and how the OS enters into this, or protected mode, or virtual x86 mode. It's just a simple context switch in how the CPU deals with register renaming and the pipeline (all of which differs with the version of the CPU). — Johan, Oct 03 '11 at 21:14
@Johan: it's not a context switch. at most you'll see an unexpected stall in instruction execution because of this, and either way, the OP is asking about the opposite case: partial register accesses are faster than full accesses. — Nathan Fellman, Oct 03 '11 at 21:32
For the record, ax is also a partial register write to eax (which is later read). The `mov ax, bx` way requires merging the partial-reg write of `bh` into `ebx`, which is the problem. So this is in fact just another partial-reg-stall problem. Apparently merging two single-byte partial writes to eax is no more expensive than merging one. — Peter Cordes, Feb 03 '16 at 05:18

score 10 · Answer 1 · edited Aug 13 '17 at 17:44

10

Why is it slow
The reason using a 16-bit register is expensive as opposed to using an 8-bit register is that 16-bit register instructions are decoded in microcode. This means an extra cycle during decoding and inability to be paired whilst decoding.
Also because ax is a partial register it will take an extra cycle to execute because the top part of the register needs to be combined with the write to the lower part.
8-bit writes have special hardware put in place to speed this up, but 16-bit writes do not. Again on many processors the 16-bit instructions take 2 cycles instead of one and they do not allow pairing.

This means that instead of being able to process 12 instructions (3 per cycle) in 4 cycles, you can now only execute 1, because you have a stall when decoding the instruction into microcode and a stall when processing the microcode.

How can I make it faster?

mov al, bl
mov ah, bh

(This code takes a minimum of 2 CPU-cycles and may give a stall on the second instruction because on some (older) x86 CPU's you get a lock on EAX)
Here's what happens:

EAX is read. (cycle 1)
- The lower byte of EAX is changed (still cycle 1)
- and the full value is written back into EAX. (cycle 1)
EAX is locked for writing until the first write is fully resolved. (potential wait for multiple cycles)
The process is repeated for the high byte in EAX. (cycle 2)

On the lastest Core2 CPU's this is not so much of a problem, because extra hardware has been put in place that knows that bl and bh really never get in each other's way.

mov eax, ebx

Which moves 4 bytes at a time, that single instruction will run in 1 cpu-cycle (and can be paired with other instructions in parallel).

If you want fast code, always use the 32-bit (EAX, EBX etc) registers.
Try to avoid using the 8-bit sub-registers, unless you have to.
Never use the 16-bit registers. Even if you have to use 5 instructions in 32-bit mode, that will still be faster.
Use the movzx reg, ... (or movsx reg, ...) instructions

Speeding up the code
I see a few opportunities to speed up the code.

; some variables on stack
%define cr  DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]

mov edx,cr

loop:

add esi, dcg
mov eax, esi
shr eax, 8

add edi, dcb
mov ebx, edi
shr ebx, 16   ;higher 16 bits in ebx will be empty.
mov bh, ah

;mov eax, cr   
;add eax, dcr
;mov cr, eax

add edx,dcr
mov eax,edx

and eax,0xFFFF0000  ; clear lower 16 bits in EAX
or eax,ebx          ; merge the two. 
;mov ah, bh  ; faster
;mov al, bl


mov DWORD [epb+offset+ecx*4], eax ; requires storing the data in reverse order. 
;add edx, 4

sub ecx,1  ;dec ecx does not change the carry flag, which can cause
           ;a false dependency on previous instructions which do change CF    
jge loop

edited Aug 13 '17 at 17:44

Michael Petch

46,082
8
107
198

answered Oct 03 '11 at 13:52

Johan

74,508
24
191
319

3

The use of segment registers is independent of whether the addressing is 16-bit or 32-bit. It depends only on whether the code is executing in protected mode or real mode. There is _always_ a segment descriptor involved in a memory access, and that is independent of the data (or, for that matter, address) size. With 32-bit addressing the segments can be up to 4 GB in size, but they are still there. – hmakholm left over Monica Oct 03 '11 at 17:50
@HenningMakholm, yea, it's the context switch and the fact that although both use segments they have very different contexts and ways of operating that eats up the time. Reworded the answer. – Johan Oct 03 '11 at 17:55
1

The addressing mode controls how an instruction that contains an address computes the _offset_ within a segment. How this offset is used is completely independent of the address size; it will be compared to the segment limit and added to the base address in exactly the same way. And no context switch is involved. – hmakholm left over Monica Oct 03 '11 at 17:58
@HenningMakholm There has to be, because something is eating up the time when mixing 16-bit and 32 bit code. _(And segment registers in 32 bit are not the same as segment registers in 16 bit, but that's beside the point)_ – Johan Oct 03 '11 at 18:01
2

Segment registers are exactly the same in 32-bit and 16-bit protected mode. They work differently in _real mode_ , but that is neither here nor there. Working with 16 bit data does not force a switch to real mode (nor to virtual 8086 mode). – hmakholm left over Monica Oct 03 '11 at 18:04
@HenningMakholm, I never said or even implied such a thing. I'm not talking about 16-bit protected mode, I'm talking about the context switch between using 32 bit register operands and 16 bit register operands. – Johan Oct 03 '11 at 18:10
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/3989/discussion-between-johan-and-henning-makholm) – Johan Oct 03 '11 at 21:10
1

Johan is right, this is the problem. NEVER use 16-bit operations, unless you switch fully into that mode and stay there for a long time before switching out again. Adding random 16-bit operations to your code will destroy performance. – Tyler Durden Oct 29 '12 at 16:13
1

I thought there were separate prefixes for address size and operand size, such that one could use `mov ax,[esi+ebx*4]` or `mov eax,[bx+23]`. I can understand that modern processors wouldn't optimize for the latter, but the former shouldn't require any context switch beyond the operand-size prefix. While register scheduling could be complicated by the fact that `mov bx,ax` overwrites some of EBX but not all of it, the same would apply to `mov bh,ah` and `mov bl,al`. – supercat Nov 19 '14 at 23:14
The switch in the operand size is what causes the slowdown. This is very well documented by Intel/AMD. Applying common sense to this scenario does not help at all. The only rule to remember is. **Never use 16 bit operand if you can help it**. – Johan Nov 26 '14 at 18:23
1

@Jongware, Ja, sorry had a mov ax,bx instruction in mind. This is now fixed. – Johan Sep 21 '15 at 16:34
1

This answer has a good fix for the problem, but is completely wrong about the cause of the slowdown. It's not use of 16bit instructions per-se. Core2 can sustain 3 insns per clock alternating `mov ax, bx` and `mov ecx, edx`. The problem is **partial register stalls** when you read a wider reg with a wider width than you wrote. Instead of giving a write to `ax` a dependency on the old contents of `eax`, Intel P6-family CPUs track deps for partial regs separately. Reading the wider reg forces a merge, which stalls for 2 to 3 cycles according to Agner Fog (http://agner.org/optimize/) – Peter Cordes Feb 03 '16 at 05:03
1

SnB-family is much more efficient, just inserting an extra uop to do the merging, and Haswell manages to avoid partial-reg penalties altogether. So my theory on this code is that reading `bx` triggers a partial-reg stall for `bx`, since bh was written earlier. Neither of these is on the critical path (loop carried dep chain) since the merged reg is overwritten in the next iteration, but apparently Core2 just stalls? I haven't tested this myself. @user786653's oprofile answer sheds light on this. – Peter Cordes Feb 03 '16 at 05:08
1

Anyway, there's no "mode switch" or anything for using a 16 bit operand-size in 32 or 64bit code. Calling it a "context switch" is a really bad idea because that term has a specific technical meaning. 16bit insns suck because of partial-reg problems, and a 16bit instruction with an immediate operand also causes a decoding stall (on Intel CPUs), because the operand-size prefix means the *rest* of the instruction has a different length. This is the dreaded length-changing-prefix stall, again see Agner Fog's microarch guide. – Peter Cordes Feb 03 '16 at 05:11
Johan, and anyone else: I posted my own answer that incorporates those comments. – Peter Cordes Feb 03 '16 at 11:14

user786653 · Answer 2 · 2011-08-12T12:06:26.243

It is also faster on my Core 2 Duo CPU L9300 1.60GHz. As I wrote in a comment I think this is related to the use of partial registers (ah, al, ax). See more e.g. here, here and here (pg. 88).

I've written a small test suite to try and improve on the code, and while not using the ax version presented in the OP is the smartest, trying to eliminate partial register usage does improve on the speed (even more so than my quick attempt at freeing up another register).

To get more information on why one version is faster than another I think requires more careful reading of the source material and/or using something like Intel VTune or AMD CodeAnalyst. (It could turn out that I'm wrong)

UPDATE, while the below output from oprofile doesn't prove anything it does show that there are a lot of partial register stalls occurring in both versions, but roughly twice as many in the slowest version (triAsm2) as in the 'fast' version (triAsm1).

$ opreport -l test                            
CPU: Core 2, speed 1600 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 800500
Counted RAT_STALLS events (Partial register stall cycles) with a unit mask of 0x0f (All RAT) count 1000000
samples  %        samples  %        symbol name
21039    27.3767  10627    52.3885  triAsm2.loop
16125    20.9824  4815     23.7368  triC
14439    18.7885  4828     23.8008  triAsm1.loop
12557    16.3396  0              0  triAsm3.loop
12161    15.8243  8         0.0394  triAsm4.loop

Complete oprofile output.

Results:

triC: 7410.000000 ms, a5afb9 (C implementation of the asm code)

triAsm1: 6690.000000 ms, a5afb9 (Code from OP, using al and ah)

triAsm2: 9290.000000 ms, a5afb9 (Code from OP, using ax)

triAsm3: 5760.000000 ms, a5afb9 (Straight forward translation of OPs code to one without partial register usage)

triAsm4: 5640.000000 ms, a5afb9 (Quick attempt at making it faster)

Here is my test suite, compiled with -std=c99 -ggdb -m32 -O3 -march=native -mtune=native:

test.c:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

extern void triC(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
extern void triAsm1(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
extern void triAsm2(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
extern void triAsm3(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
extern void triAsm4(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);

uint32_t scanline[640];

#define test(tri) \
    {\
        clock_t start = clock();\
        srand(60);\
        for (int i = 0; i < 5000000; i++) {\
            tri(scanline, rand() % 640, 10<<16, 20<<16, 30<<16, 1<<14, 1<<14, 1<<14);\
        }\
        printf(#tri ": %f ms, %x\n",(clock()-start)*1000.0/CLOCKS_PER_SEC,scanline[620]);\
    }

int main() {
    test(triC);
    test(triAsm1);
    test(triAsm2);
    test(triAsm3);
    test(triAsm4);
    return 0;
}

tri.c:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

void triC(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb) {
    while (cnt--) {
        cr += dcr;
        cg += dcg;
        cb += dcb;
        *dest++ = (cr & 0xffff0000) | ((cg >> 8) & 0xff00) | ((cb >> 16) & 0xff);
    }
}

atri.asm:

    bits 32
    section .text
    global triAsm1
    global triAsm2
    global triAsm3
    global triAsm4

%define cr DWORD [ebp+0x10]
%define dcr DWORD [ebp+0x1c]
%define dcg DWORD [ebp+0x20]
%define dcb DWORD [ebp+0x24]

triAsm1:
    push ebp
    mov ebp, esp

    pusha

    mov edx, [ebp+0x08] ; dest
    mov ecx, [ebp+0x0c] ; cnt
    mov esi, [ebp+0x14] ; cg
    mov edi, [ebp+0x18] ; cb

.loop:

    add esi, dcg
    mov eax, esi
    shr eax, 8

    add edi, dcb
    mov ebx, edi
    shr ebx, 16
    mov bh, ah

    mov eax, cr
    add eax, dcr
    mov cr, eax

    mov ah, bh  ; faster
    mov al, bl

    mov DWORD [edx], eax

    add edx, 4

    dec ecx
    jge .loop

    popa

    pop ebp
    ret


triAsm2:
    push ebp
    mov ebp, esp

    pusha

    mov edx, [ebp+0x08] ; dest
    mov ecx, [ebp+0x0c] ; cnt
    mov esi, [ebp+0x14] ; cg
    mov edi, [ebp+0x18] ; cb

.loop:

    add esi, dcg
    mov eax, esi
    shr eax, 8

    add edi, dcb
    mov ebx, edi
    shr ebx, 16
    mov bh, ah

    mov eax, cr
    add eax, dcr
    mov cr, eax

    mov ax, bx ; slower

    mov DWORD [edx], eax

    add edx, 4

    dec ecx
    jge .loop

    popa

    pop ebp
    ret

triAsm3:
    push ebp
    mov ebp, esp

    pusha

    mov edx, [ebp+0x08] ; dest
    mov ecx, [ebp+0x0c] ; cnt
    mov esi, [ebp+0x14] ; cg
    mov edi, [ebp+0x18] ; cb

.loop:
    mov eax, cr
    add eax, dcr
    mov cr, eax

    and eax, 0xffff0000

    add esi, dcg
    mov ebx, esi
    shr ebx, 8
    and ebx, 0x0000ff00
    or eax, ebx

    add edi, dcb
    mov ebx, edi
    shr ebx, 16
    and ebx, 0x000000ff
    or eax, ebx

    mov DWORD [edx], eax

    add edx, 4

    dec ecx
    jge .loop

    popa

    pop ebp
    ret

triAsm4:
    push ebp
    mov ebp, esp

    pusha

    mov [stackptr], esp

    mov edi, [ebp+0x08] ; dest
    mov ecx, [ebp+0x0c] ; cnt
    mov edx, [ebp+0x10] ; cr
    mov esi, [ebp+0x14] ; cg
    mov esp, [ebp+0x18] ; cb

.loop:
    add edx, dcr
    add esi, dcg
    add esp, dcb

    ;*dest++ = (cr & 0xffff0000) | ((cg >> 8) & 0xff00) | ((cb >> 16) & 0xff);
    mov eax, edx ; eax=cr
    and eax, 0xffff0000

    mov ebx, esi ; ebx=cg
    shr ebx, 8
    and ebx, 0xff00
    or eax, ebx
    ;mov ah, bh

    mov ebx, esp
    shr ebx, 16
    and ebx, 0xff
    or eax, ebx
    ;mov al, bl

    mov DWORD [edi], eax
    add edi, 4

    dec ecx
    jge .loop

    mov esp, [stackptr]

    popa

    pop ebp
    ret

    section .data
stackptr: dd 0

What is this `opreport` that you used to measure performance here? Where can I find informationt on it? — Nathan Fellman, Aug 13 '11 at 08:11
I know this answer is over 4 years old, but I ended up writing a whole answer after commenting on the mistakes in the highest voted answer. I also wrote an asm version which should be significantly faster even than your `triAsm4`, since I used a loop with less overhead, and more importantly came up with two ways to do a lot less masking. (pre-shift stuff to the left so there can't be garbage high bits, only low bits to get rid of. And also use shld). I think mine should run at just under 4c per iteration on core2, but I didn't test it. @blackbear, you might like my answer too, then :) — Peter Cordes, Feb 03 '16 at 11:10

Peter Cordes · Answer 3 · 2017-08-13T13:20:12.633

summary: 16-bit instructions are not the problem directly. The problem is reading wider registers after writing partial registers, causing a partial-register stall on Core2. This is much less of a problem on Sandybridge and later, since they merge much more cheaply. mov ax, bx causes an extra merge, but even the OP's "fast" version has some stalls.

See the end of this answer for an alternate scalar inner-loop which should be faster than the other two answers, using shld to shuffle bytes between registers. Pre-shifting things left by 8b outside the loop puts the byte we want at the top of each register, which makes this really cheap. It should run at slightly better than one iteration per 4 clock cycles on 32bit core2, and saturate all three execution ports with no stalls. It should run at one iteration per 2.5c on Haswell.

To actually do this fast though, look at auto-vectorized compiler output, and maybe pare that down or re-implement with vector intrinsics.

Contrary to the claims of 16bit operand-size instructions being slow, Core2 can in theory sustain 3 insns per clock alternating mov ax, bx and mov ecx, edx. There is no "mode switch" of any kind. (As everyone has pointed out, "context switch" is a terrible choice of made up name, because it already has a specific technical meaning.)

The problem is partial register stalls when you read a reg that you previously wrote only a part of. Instead of forcing a write to ax a wait on the old contents of eax being ready (false dependency), Intel P6-family CPUs track dependencies for partial regs separately. Reading the wider reg forces a merge, which stalls for 2 to 3 cycles according to Agner Fog. The other big problem with using 16bit operand size is with immediate operands, where you get can LCP stalls in the decoders on Intel CPUs for immediates that don't fit in an imm8.

SnB-family is much more efficient, just inserting an extra uop to do the merging without stalling while it does so. AMD and Intel Silvermont (and P4) don't rename partial registers separately at all, so they do have "false" dependencies on the previous contents. In this case, we're later reading the full register, so it's a true dependency because we want the merging, so those CPUs have an advantage. (Intel Haswell/Skylake (and maybe IvB) don't rename AL separately from RAX; they only rename AH/BH/CH/DH separately. And reading high8 registers has extra latency. See this Q&A about partial registers on HSW/SKL for the details.)

Neither of the partial-reg stalls are part of a long dependency chain, since the merged reg is overwritten in the next iteration. Apparently Core2 just stalls the front-end, or even the whole out-of-order execution core? I was meaning to ask a question about how expensive partial register slowdowns are on Core2, and how to measure the cost on SnB. @user786653's oprofile answer sheds some light on it. (And also has some really helpful C reverse-engineered from the OP's asm to help make it clear what this function is really trying to accomplish).

Compiling that C with a modern gcc can produce vectorized asm that does the loop 4 dwords at a time, in an xmm register. It does a much better job when it can use SSE4.1, though. (And clang doesn't auto-vectorize this at all with -march=core2, but it does unroll a lot, probably interleaving multiple iterations to avoid partial-register stuff.) If you don't tell gcc that dest is aligned, it generates a huge amount of scalar prologue/epilogue around the vectorized loop to reach a point where it is aligned.

It turns the integer args into vector constants (on the stack, since 32bit code only has 8 vector registers). The inner loop is

.L4:
        movdqa  xmm0, XMMWORD PTR [esp+64]
        mov     ecx, edx
        add     edx, 1
        sal     ecx, 4
        paddd   xmm0, xmm3
        paddd   xmm3, XMMWORD PTR [esp+16]
        psrld   xmm0, 8
        movdqa  xmm1, xmm0
        movdqa  xmm0, XMMWORD PTR [esp+80]
        pand    xmm1, xmm7
        paddd   xmm0, xmm2
        paddd   xmm2, XMMWORD PTR [esp+32]
        psrld   xmm0, 16
        pand    xmm0, xmm6
        por     xmm0, xmm1
        movdqa  xmm1, XMMWORD PTR [esp+48]
        paddd   xmm1, xmm4
        paddd   xmm4, XMMWORD PTR [esp]
        pand    xmm1, xmm5
        por     xmm0, xmm1
        movaps  XMMWORD PTR [eax+ecx], xmm0
        cmp     ebp, edx
        ja      .L4

Notice that there's one store in the whole loop. All the loads are just vectors it calculated earlier, stored on the stack as locals.

There are several ways to speed up the OP's code. Most obvious is that we don't need make a stack frame, freeing up ebp. The most obvious use for it is to hold cr, which the OP spills to the stack. user786653's triAsm4 does this, except he uses the insane troll logic variation of it: he makes a stack frame and sets up ebp as usually, but then stashes esp in a static location and uses it as a scratch register!! This will obviously break horribly if your program has any signal handlers, but otherwise is fine (except for making debugging harder).

If you're going to go so crazy that you want to use esp as a scratch, copy the function args to static locations too, so you don't need a register to hold any pointers to stack memory. (Saving the old esp in an MMX register is also an option, so you can do this in re-entrant functions used from multiple threads at once. But not if you copy the args somewhere static, unless it's to thread-local storage with a segment override or something. You don't have to worry about re-entrancy from within the same thread, because the stack pointer is in an unusable state. Anything like a signal handler that could re-enter your function in the same thread will instead crash. >.<)

Spilling cr is actually not the most optimal choice: Instead of using two registers for looping (counter and pointer), we can just keep a dst pointer in a register. Do the loop boundary by calculating an end pointer (one past the end: dst+4*cnt), and use a cmp with a memory operand as the loop condition.

Comparing against an end-pointer with cmp/jb is actually more optimal on Core2 than dec / jge anyway. Unsigned conditions can macro-fuse with cmp. Until SnB, only cmp and test can macro-fuse at all. (This is also true for AMD Bulldozer, but cmp and test can fuse with any jcc on AMD). SnB-family CPUs can macro-fuse dec/jge. Interestingly, Core2 can only macro-fuse signed compares (like jge) with test, not cmp. (An unsigned compare is the correct choice for an address anyway, since 0x8000000 is not special, but 0 is. I didn't use jb just as a risky optimization.)

We can't pre-shift cb and dcb down to the low byte, because they need to maintain more precision internally. However, we can left shift the other two, so they're up against the left edge of their registers. Right-shifting them down to their destination position won't leave any garbage high bits from possible overflow.

Instead of merging into eax, we could do overlapping stores. Store 4B from eax, then store the low 2B from bx. That would save the partial-reg stall in eax, but generate one for merging bh into ebx, so that's of limited value. Possibly a 4B write and two overlapping 1B stores are actually good here, but that's starting to be a lot of stores. Still, it might be spread over enough other instructions to not bottleneck on the store port.

user786653's triAsm3 uses masking and or instructions for merging, which looks like a sensible approach for Core2. For AMD, Silvermont, or P4, using 8b and 16b mov instructions to merge partial registers is probably actually good. You can also take advantage of it on Ivybridge/Haswell/Skylake if you only write the low8 or low16 to avoid merging penalties. However, I came up with multiple improvements over that to require less masking.

; use defines you can put [] around so it's clear they're memory refs
; %define cr  ebp+0x10
%define cr  esp+something that depends on how much we pushed
%define dcr ebp+0x1c  ;; change these to work from ebp, too.
%define dcg ebp+0x20
%define dcb ebp+0x24

; esp-relative offsets may be wrong, just quickly did it in my head without testing:
; we push 3 more regs after ebp, which was the point at which ebp snapshots esp in the stack-frame version.  So add 0xc (i.e. mentally add 0x10 and subract 4)
; 32bit code is dumb anyway.  64bit passes args in regs.

%define dest_arg  esp+14
%define cnt_arg   esp+18
... everything else

tri_pjc:
    push    ebp
    push    edi
    push    esi
    push    ebx  ; only these 4 need to be preserved in the normal 32bit calling convention

    mov     ebp, [cr]
    mov     esi, [cg]
    mov     edi, [cb]

    shl     esi,   8          ; put the bits we want at the high edge, so we don't have to mask after shifting in zeros
    shl     [dcg], 8
    shl     edi,   8
    shl     [dcb], 8
       ; apparently the original code doesn't care if cr overflows into the top byte.

    mov     edx, [dest_arg]
    mov     ecx, [cnt_arg]
    lea     ecx, [edx + ecx*4] ; one-past the end, to be used as a loop boundary
    mov    [dest_arg], ecx    ; spill it back to the stack, where we only need to read it.

ALIGN 16
.loop: ; SEE BELOW, this inner loop can be even more optimized
    add     esi, [dcg]
    mov     eax, esi
    shr     eax, 24           ; eax bytes = { 0  0  0 cg }

    add     edi, [dcb]
    shld    eax, edi, 8       ; eax bytes = { 0  0 cg cb }

    add     ebp, [dcr]
    mov     ecx, ebp
    and     ecx, 0xffff0000
    or      eax, ecx          ; eax bytes = { x cr cg cb}  where x is overflow from cr.  Kill that by changing the mask to 0x00ff0000
    ; another shld to merge might be faster on other CPUs, but not core2
    ; merging with mov cx, ax   would also be possible on CPUs where that's cheap (AMD, and Intel IvB and later)

    mov    DWORD [edx], eax
    ; alternatively:
    ; mov    DWORD [edx], ebp
    ; mov     WORD [edx], eax   ; this insn replaces the mov/and/or merging

    add     edx, 4
    cmp     edx, [dest_arg]   ; core2 can macro-fuse cmp/unsigned condition, but not signed
    jb .loop

    pop     ebx
    pop     esi
    pop     edi
    pop     ebp
    ret

I ended up with one more register than I needed, after doing the omit-frame-pointer and putting the loop-boundary in memory. You could either cache something extra in registers, or avoid saving / restoring a register. Maybe keeping the loop boundary in ebx is the best bet. It basically saves one prologue instruction. Keeping dcb or dcg in a register would require an extra insn in the prologue to load it. (The shifts with a memory destination are ugly and slow, even on Skylake, but small code size. They're not in the loop, and core2 doesn't have a uop cache. load/shift/store separately is still 3 uops, so you can't beat it unless you are going to keep it in a reg instead of storing.)

shld is a 2-uop insn on P6 (Core2). Luckily, it's easy to order the loop so it's the fifth instruction, preceded by four single-uop instructions. It should hit the decoders as the first uop in the 2nd group of 4, so it doesn't cause a delay in the frontend. (Core2 can decode 1-1-1-1, 2-1-1-1, 3-1-1-1, or 4-1-1-1 uops-per-insn patterns. SnB and later redesigned the decoders, and added a uop cache that makes decoding not usually the bottleneck, and can handle only groups of 1-1-1-1, 2-1-1, 3-1, and 4.)

shld is horrible on AMD K8, K10, Bulldozer-family, and Jaguar. 6 m-ops, 3c latency, and one per 3c throughput. It's great on Atom/Silvermont with 32bit operand-size, but horrible with 16 or 64b registers.

This insn ordering might decode with the cmp as the last insn of a group, and then jb by itself, making it not macro-fuse. This might give an extra advantage to the overlapping-stores method of merging, more than just saving a uop, if front-end effects are a factor for this loop. (And I suspect they would be, given the high degree of parallelism and that the loop-carried dep chains are short, so work for multiple iterations can be happening at once.)

So: fused-domain uops per iteration: 13 on Core2 (assuming macro-fusion which might not actually happen), 12 on SnB-family. So IvB should run this at one iteration per 3c (assuming none of the 3 ALU ports are a bottleneck. The mov r,r don't need ALU ports, and neither does the store. add and booleans can use any port. shr and shld are the only that can't run on a wide choice of ports, and there are only two shifts per three cycles.) Core2 will take 4c per iteration to issue it even if it manages to avoid any frontend bottlenecks, and even longer to run it.

We're maybe still running fast enough on Core2 that spilling/reloading cr to the stack every iteration would be a bottleneck if we were still doing that. It adds a memory round-trip (5c) to a loop-carried dependency chain, making a total dep chain length of 6 cycles (including the add).

Hmm, actually even Core2 might win from using two shld insns to merge. It also saves another register!

ALIGN 16
;mov ebx, 111           ; IACA start
;db 0x64, 0x67, 0x90
.loop:
    add     ebp, [dcr]
    mov     eax, ebp
    shr     eax, 16           ; eax bytes = { 0  0  x cr}  where x is overflow from cr.  Kill that pre-shifting cr and dcr like the others, and use shr 24 here

    add     esi, [dcg]
    shld    eax, esi, 8       ; eax bytes = { 0  x cr cg}
    add     edx, 4     ; this goes between the `shld`s to help with decoder throughput on pre-SnB, and to not break macro-fusion.
    add     edi, [dcb]
    shld    eax, edi, 8       ; eax bytes = { x cr cg cb}
    mov    DWORD [edx-4], eax

    cmp     edx, ebx      ; use our spare register here
    jb .loop     ; core2 can macro-fuse cmp/unsigned condition, but not signed.  Macro-fusion works in 32-bit mode only on Core2.

;mov ebx, 222           ; IACA end
;db 0x64, 0x67, 0x90

Per-iteration: SnB: 10 fused-domain uops. Core2: 12 fused-domain uops, so this is shorter than the previous version on Intel CPUs (but horrible on AMD). Using shld saves mov instructions because we can use it to non-destructively extract the high byte of the source.

Core2 can issue the loop at one iteration per 3 clocks. (It was Intel's first CPU with a 4 uop wide pipeline).

From Agner Fog's table for Merom/Conroe (first gen Core2) (note that David Kanter's block diagram has p2 and p5 reversed):

shr: runs on p0/p5
shld: 2 uops for p0/p1/p5? Agner's table for pre-Haswell doesn't say which uops can go where.
mov r,r, add, and: p0/p1/p5
fused cmp-and-branch: p5
store: p3 and p4 (these micro-fuse into 1 fused-domain store uop)
each load: p2. (all of the loads are micro-fused with ALU ops in the fused domain).

According to IACA, which has a mode for Nehalem but not Core2, most of the shld uops go to p1, with only less than 0.6 on average from each insn running on other ports. Nehalem has essentially the same execution units as Core2. All the instructions involved here have the same uop costs and port requirements on NHM and Core2. IACA's analysis looks good to me, and I don't want to check everything on my own for this answer to a 5 year old question. It was fun answering, though. :)

Anyway, according to IACA, uops should distribute well between ports. It figures Nehalem can run the loop at one iteration per 3.7 cycles, saturating all three execution ports. It's analysis looks good to me. (Note that I had to drop the memory operand from the cmp to make IACA not give stupid results.) That's clearly needed anyway, since pre-SnB can only do one load per cycle: we'd bottleneck on port2 with four loads in the loop.

IACA doesn't agree with Agner Fog's testing for IvB and SnB (it thinks shld is still 2 uops, when it's actually one, according to my testing on SnB). So its numbers are silly.

IACA looks correct for Haswell, where it says the bottleneck is the frontend. It thinks HSW can run it at one per 2.5c. (The loop buffer in Haswell at least can issue loops in a non-integer number of cycles per iteration. Sandybridge may be limited to whole numbers of cycles, where the taken loop-branch ends an issue-group.)

I also found I needed to use iaca.sh -no_interiteration, or else it would think there was an interiteration loop-carried dependency and think the loop would take 12c on NHM.

Great analysis! And yeah, using `esp` instead of `ebp` as an extra register is, uhm unconventional, but that way I didn't need to redefine the macros. :) — user786653, Feb 08 '16 at 17:41
@user786653: and that's all that matters for a quick&dirty test. Fair enough :P — Peter Cordes, Feb 08 '16 at 20:44

score 4 · Answer 4 · answered Aug 11 '11 at 19:32

4

In 32-bit code, mov ax, bx needs an operand-size prefix, whereas byte-sized moves don't. Apparently modern processor designers do not spend much effort at getting the operand-size prefix to decode quickly, though it surprises me that the penalty would be enough to do two byte-sized moves instead.

answered Aug 11 '11 at 19:32

hmakholm left over Monica

23,074
3
51
73

4

I doubt that the addition of the 66 prefix caused this. It's more "expensive" to decode two 2-byte instructions than one 3-byte instruction. – Nathan Fellman Aug 11 '11 at 19:36
Well, for all I know -- speaking from relatively blissful ignorance -- 66h might decode via a trap to microcode, and that would certainly be slower. With the exception of memory stores (which could well need special-casing in the decoder logic anyway), I don't think compiled code would have much occasion to include 16-bit instructions. – hmakholm left over Monica Aug 11 '11 at 19:51
14

I speak from expertise (I'm a validation engineer at Intel, and I worked on the Core Duo and Core 2 Duo projects), and I can tell you that 66h does not require any microcode assistance whatsoever. – Nathan Fellman Aug 11 '11 at 20:15
2

(bowing to authority) I got nothing, then. – hmakholm left over Monica Aug 11 '11 at 23:23

Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

4 Answers4

Linked