summary: 16-bit instructions are not the problem directly. The problem is reading wider registers after writing partial registers, causing a partial-register stall on Core2. This is much less of a problem on Sandybridge and later, since they merge much more cheaply. mov ax, bx
causes an extra merge, but even the OP's "fast" version has some stalls.
See the end of this answer for an alternate scalar inner-loop which should be faster than the other two answers, using shld
to shuffle bytes between registers. Pre-shifting things left by 8b outside the loop puts the byte we want at the top of each register, which makes this really cheap. It should run at slightly better than one iteration per 4 clock cycles on 32bit core2, and saturate all three execution ports with no stalls. It should run at one iteration per 2.5c on Haswell.
To actually do this fast though, look at auto-vectorized compiler output, and maybe pare that down or re-implement with vector intrinsics.
Contrary to the claims of 16bit operand-size instructions being slow, Core2 can in theory sustain 3 insns per clock alternating mov ax, bx
and mov ecx, edx
. There is no "mode switch" of any kind. (As everyone has pointed out, "context switch" is a terrible choice of made up name, because it already has a specific technical meaning.)
The problem is partial register stalls when you read a reg that you previously wrote only a part of. Instead of forcing a write to ax
a wait on the old contents of eax
being ready (false dependency), Intel P6-family CPUs track dependencies for partial regs separately. Reading the wider reg forces a merge, which stalls for 2 to 3 cycles according to Agner Fog. The other big problem with using 16bit operand size is with immediate operands, where you get can LCP stalls in the decoders on Intel CPUs for immediates that don't fit in an imm8.
SnB-family is much more efficient, just inserting an extra uop to do the merging without stalling while it does so. AMD and Intel Silvermont (and P4) don't rename partial registers separately at all, so they do have "false" dependencies on the previous contents. In this case, we're later reading the full register, so it's a true dependency because we want the merging, so those CPUs have an advantage. (Intel Haswell/Skylake (and maybe IvB) don't rename AL separately from RAX; they only rename AH/BH/CH/DH separately. And reading high8 registers has extra latency.
See this Q&A about partial registers on HSW/SKL for the details.)
Neither of the partial-reg stalls are part of a long dependency chain, since the merged reg is overwritten in the next iteration. Apparently Core2 just stalls the front-end, or even the whole out-of-order execution core? I was meaning to ask a question about how expensive partial register slowdowns are on Core2, and how to measure the cost on SnB. @user786653's oprofile answer sheds some light on it. (And also has some really helpful C reverse-engineered from the OP's asm to help make it clear what this function is really trying to accomplish).
Compiling that C with a modern gcc can produce vectorized asm that does the loop 4 dwords at a time, in an xmm register. It does a much better job when it can use SSE4.1, though. (And clang doesn't auto-vectorize this at all with -march=core2
, but it does unroll a lot, probably interleaving multiple iterations to avoid partial-register stuff.) If you don't tell gcc that dest
is aligned, it generates a huge amount of scalar prologue/epilogue around the vectorized loop to reach a point where it is aligned.
It turns the integer args into vector constants (on the stack, since 32bit code only has 8 vector registers). The inner loop is
.L4:
movdqa xmm0, XMMWORD PTR [esp+64]
mov ecx, edx
add edx, 1
sal ecx, 4
paddd xmm0, xmm3
paddd xmm3, XMMWORD PTR [esp+16]
psrld xmm0, 8
movdqa xmm1, xmm0
movdqa xmm0, XMMWORD PTR [esp+80]
pand xmm1, xmm7
paddd xmm0, xmm2
paddd xmm2, XMMWORD PTR [esp+32]
psrld xmm0, 16
pand xmm0, xmm6
por xmm0, xmm1
movdqa xmm1, XMMWORD PTR [esp+48]
paddd xmm1, xmm4
paddd xmm4, XMMWORD PTR [esp]
pand xmm1, xmm5
por xmm0, xmm1
movaps XMMWORD PTR [eax+ecx], xmm0
cmp ebp, edx
ja .L4
Notice that there's one store in the whole loop. All the loads are just vectors it calculated earlier, stored on the stack as locals.
There are several ways to speed up the OP's code. Most obvious is that we don't need make a stack frame, freeing up ebp
. The most obvious use for it is to hold cr
, which the OP spills to the stack. user786653's triAsm4
does this, except he uses the insane troll logic variation of it: he makes a stack frame and sets up ebp
as usually, but then stashes esp
in a static location and uses it as a scratch register!! This will obviously break horribly if your program has any signal handlers, but otherwise is fine (except for making debugging harder).
If you're going to go so crazy that you want to use esp
as a scratch, copy the function args to static locations too, so you don't need a register to hold any pointers to stack memory. (Saving the old esp
in an MMX register is also an option, so you can do this in re-entrant functions used from multiple threads at once. But not if you copy the args somewhere static, unless it's to thread-local storage with a segment override or something. You don't have to worry about re-entrancy from within the same thread, because the stack pointer is in an unusable state. Anything like a signal handler that could re-enter your function in the same thread will instead crash. >.<)
Spilling cr
is actually not the most optimal choice: Instead of using two registers for looping (counter and pointer), we can just keep a dst pointer in a register. Do the loop boundary by calculating an end pointer (one past the end: dst+4*cnt
), and use a cmp
with a memory operand as the loop condition.
Comparing against an end-pointer with cmp
/jb
is actually more optimal on Core2 than dec
/ jge
anyway. Unsigned conditions can macro-fuse with cmp
. Until SnB, only cmp
and test
can macro-fuse at all. (This is also true for AMD Bulldozer, but cmp and test can fuse with any jcc on AMD). SnB-family CPUs can macro-fuse dec
/jge
. Interestingly, Core2 can only macro-fuse signed compares (like jge
) with test
, not cmp
. (An unsigned compare is the correct choice for an address anyway, since 0x8000000
is not special, but 0
is. I didn't use jb
just as a risky optimization.)
We can't pre-shift cb
and dcb
down to the low byte, because they need to maintain more precision internally. However, we can left shift the other two, so they're up against the left edge of their registers. Right-shifting them down to their destination position won't leave any garbage high bits from possible overflow.
Instead of merging into eax
, we could do overlapping stores. Store 4B from eax
, then store the low 2B from bx
. That would save the partial-reg stall in eax, but generate one for merging bh
into ebx
, so that's of limited value. Possibly a 4B write and two overlapping 1B stores are actually good here, but that's starting to be a lot of stores. Still, it might be spread over enough other instructions to not bottleneck on the store port.
user786653's triAsm3 uses masking and or
instructions for merging, which looks like a sensible approach for Core2. For AMD, Silvermont, or P4, using 8b and 16b mov instructions to merge partial registers is probably actually good. You can also take advantage of it on Ivybridge/Haswell/Skylake if you only write the low8 or low16 to avoid merging penalties. However, I came up with multiple improvements over that to require less masking.
; use defines you can put [] around so it's clear they're memory refs
; %define cr ebp+0x10
%define cr esp+something that depends on how much we pushed
%define dcr ebp+0x1c ;; change these to work from ebp, too.
%define dcg ebp+0x20
%define dcb ebp+0x24
; esp-relative offsets may be wrong, just quickly did it in my head without testing:
; we push 3 more regs after ebp, which was the point at which ebp snapshots esp in the stack-frame version. So add 0xc (i.e. mentally add 0x10 and subract 4)
; 32bit code is dumb anyway. 64bit passes args in regs.
%define dest_arg esp+14
%define cnt_arg esp+18
... everything else
tri_pjc:
push ebp
push edi
push esi
push ebx ; only these 4 need to be preserved in the normal 32bit calling convention
mov ebp, [cr]
mov esi, [cg]
mov edi, [cb]
shl esi, 8 ; put the bits we want at the high edge, so we don't have to mask after shifting in zeros
shl [dcg], 8
shl edi, 8
shl [dcb], 8
; apparently the original code doesn't care if cr overflows into the top byte.
mov edx, [dest_arg]
mov ecx, [cnt_arg]
lea ecx, [edx + ecx*4] ; one-past the end, to be used as a loop boundary
mov [dest_arg], ecx ; spill it back to the stack, where we only need to read it.
ALIGN 16
.loop: ; SEE BELOW, this inner loop can be even more optimized
add esi, [dcg]
mov eax, esi
shr eax, 24 ; eax bytes = { 0 0 0 cg }
add edi, [dcb]
shld eax, edi, 8 ; eax bytes = { 0 0 cg cb }
add ebp, [dcr]
mov ecx, ebp
and ecx, 0xffff0000
or eax, ecx ; eax bytes = { x cr cg cb} where x is overflow from cr. Kill that by changing the mask to 0x00ff0000
; another shld to merge might be faster on other CPUs, but not core2
; merging with mov cx, ax would also be possible on CPUs where that's cheap (AMD, and Intel IvB and later)
mov DWORD [edx], eax
; alternatively:
; mov DWORD [edx], ebp
; mov WORD [edx], eax ; this insn replaces the mov/and/or merging
add edx, 4
cmp edx, [dest_arg] ; core2 can macro-fuse cmp/unsigned condition, but not signed
jb .loop
pop ebx
pop esi
pop edi
pop ebp
ret
I ended up with one more register than I needed, after doing the omit-frame-pointer and putting the loop-boundary in memory. You could either cache something extra in registers, or avoid saving / restoring a register. Maybe keeping the loop boundary in ebx
is the best bet. It basically saves one prologue instruction. Keeping dcb
or dcg
in a register would require an extra insn in the prologue to load it. (The shifts with a memory destination are ugly and slow, even on Skylake, but small code size. They're not in the loop, and core2 doesn't have a uop cache. load/shift/store separately is still 3 uops, so you can't beat it unless you are going to keep it in a reg instead of storing.)
shld
is a 2-uop insn on P6 (Core2). Luckily, it's easy to order the loop so it's the fifth instruction, preceded by four single-uop instructions. It should hit the decoders as the first uop in the 2nd group of 4, so it doesn't cause a delay in the frontend. (Core2 can decode 1-1-1-1, 2-1-1-1, 3-1-1-1, or 4-1-1-1 uops-per-insn patterns. SnB and later redesigned the decoders, and added a uop cache that makes decoding not usually the bottleneck, and can handle only groups of 1-1-1-1, 2-1-1, 3-1, and 4.)
shld
is horrible on AMD K8, K10, Bulldozer-family, and Jaguar. 6 m-ops, 3c latency, and one per 3c throughput. It's great on Atom/Silvermont with 32bit operand-size, but horrible with 16 or 64b registers.
This insn ordering might decode with the cmp
as the last insn of a group, and then jb
by itself, making it not macro-fuse. This might give an extra advantage to the overlapping-stores method of merging, more than just saving a uop, if front-end effects are a factor for this loop. (And I suspect they would be, given the high degree of parallelism and that the loop-carried dep chains are short, so work for multiple iterations can be happening at once.)
So: fused-domain uops per iteration: 13 on Core2 (assuming macro-fusion which might not actually happen), 12 on SnB-family. So IvB should run this at one iteration per 3c (assuming none of the 3 ALU ports are a bottleneck. The mov r,r
don't need ALU ports, and neither does the store. add
and booleans can use any port. shr
and shld
are the only that can't run on a wide choice of ports, and there are only two shifts per three cycles.) Core2 will take 4c per iteration to issue it even if it manages to avoid any frontend bottlenecks, and even longer to run it.
We're maybe still running fast enough on Core2 that spilling/reloading cr
to the stack every iteration would be a bottleneck if we were still doing that. It adds a memory round-trip (5c) to a loop-carried dependency chain, making a total dep chain length of 6 cycles (including the add).
Hmm, actually even Core2 might win from using two shld
insns to merge. It also saves another register!
ALIGN 16
;mov ebx, 111 ; IACA start
;db 0x64, 0x67, 0x90
.loop:
add ebp, [dcr]
mov eax, ebp
shr eax, 16 ; eax bytes = { 0 0 x cr} where x is overflow from cr. Kill that pre-shifting cr and dcr like the others, and use shr 24 here
add esi, [dcg]
shld eax, esi, 8 ; eax bytes = { 0 x cr cg}
add edx, 4 ; this goes between the `shld`s to help with decoder throughput on pre-SnB, and to not break macro-fusion.
add edi, [dcb]
shld eax, edi, 8 ; eax bytes = { x cr cg cb}
mov DWORD [edx-4], eax
cmp edx, ebx ; use our spare register here
jb .loop ; core2 can macro-fuse cmp/unsigned condition, but not signed. Macro-fusion works in 32-bit mode only on Core2.
;mov ebx, 222 ; IACA end
;db 0x64, 0x67, 0x90
Per-iteration: SnB: 10 fused-domain uops. Core2: 12 fused-domain uops, so this is shorter than the previous version on Intel CPUs (but horrible on AMD). Using shld
saves mov
instructions because we can use it to non-destructively extract the high byte of the source.
Core2 can issue the loop at one iteration per 3 clocks. (It was Intel's first CPU with a 4 uop wide pipeline).
From Agner Fog's table for Merom/Conroe (first gen Core2) (note that David Kanter's block diagram has p2 and p5 reversed):
shr
: runs on p0/p5
shld
: 2 uops for p0/p1/p5? Agner's table for pre-Haswell doesn't say which uops can go where.
mov r,r
, add
, and
: p0/p1/p5
- fused cmp-and-branch: p5
- store: p3 and p4 (these micro-fuse into 1 fused-domain store uop)
- each load: p2. (all of the loads are micro-fused with ALU ops in the fused domain).
According to IACA, which has a mode for Nehalem but not Core2, most of the shld
uops go to p1, with only less than 0.6 on average from each insn running on other ports. Nehalem has essentially the same execution units as Core2. All the instructions involved here have the same uop costs and port requirements on NHM and Core2. IACA's analysis looks good to me, and I don't want to check everything on my own for this answer to a 5 year old question. It was fun answering, though. :)
Anyway, according to IACA, uops should distribute well between ports. It figures Nehalem can run the loop at one iteration per 3.7 cycles, saturating all three execution ports. It's analysis looks good to me. (Note that I had to drop the memory operand from the cmp
to make IACA not give stupid results.) That's clearly needed anyway, since pre-SnB can only do one load per cycle: we'd bottleneck on port2 with four loads in the loop.
IACA doesn't agree with Agner Fog's testing for IvB and SnB (it thinks shld is still 2 uops, when it's actually one, according to my testing on SnB). So its numbers are silly.
IACA looks correct for Haswell, where it says the bottleneck is the frontend. It thinks HSW can run it at one per 2.5c. (The loop buffer in Haswell at least can issue loops in a non-integer number of cycles per iteration. Sandybridge may be limited to whole numbers of cycles, where the taken loop-branch ends an issue-group.)
I also found I needed to use iaca.sh -no_interiteration
, or else it would think there was an interiteration loop-carried dependency and think the loop would take 12c on NHM.