Can a variable shift generate a partial register stall (or register recombining µops) on ecx
? If so, on which microarchitecture(s)?
I have tested this on Core2 (65nm), which seems to read only cl
.
_shiftbench:
push rbx
mov edx, -10000000
mov ecx, 5
_shiftloop:
mov bl, 5 ; replace by cl to see possible recombining
shl eax, cl
add edx, 1
jnz _shiftloop
pop rbx
ret
Replacing mov bl, 5
by mov cl, 5
made no difference, which it would have if there was register recombining going on, as can be demonstrated by replacing shl eax, cl
by add eax, ecx
(in my tests the version with add
experienced a 2.8x slowdown when writing to cl
instead of bl
).
Test results:
- Merom: no stall observed
- Penryn: no stall observed
- Nehalem: no stall observed
Update: the new shrx
-group of shifts in Haswell does show that stall. The shift-count argument is not written as an 8-bit register, so that might have been expected, but the textual representation really doesn't say anything about such micro-architectural details.