8

Can a variable shift generate a partial register stall (or register recombining µops) on ecx? If so, on which microarchitecture(s)?

I have tested this on Core2 (65nm), which seems to read only cl.

_shiftbench:
    push rbx
    mov edx, -10000000
    mov ecx, 5
  _shiftloop:
    mov bl, 5   ; replace by cl to see possible recombining
    shl eax, cl
    add edx, 1
    jnz _shiftloop
    pop rbx
    ret

Replacing mov bl, 5 by mov cl, 5 made no difference, which it would have if there was register recombining going on, as can be demonstrated by replacing shl eax, cl by add eax, ecx (in my tests the version with add experienced a 2.8x slowdown when writing to cl instead of bl).


Test results:

  • Merom: no stall observed
  • Penryn: no stall observed
  • Nehalem: no stall observed

Update: the new shrx-group of shifts in Haswell does show that stall. The shift-count argument is not written as an 8-bit register, so that might have been expected, but the textual representation really doesn't say anything about such micro-architectural details.

phuclv
  • 37,963
  • 15
  • 156
  • 475
harold
  • 61,398
  • 6
  • 86
  • 164
  • 1
    There is no opcode for `shl` by `ecx`. Why do you think there is? – interjay Oct 27 '12 at 20:06
  • 1
    @interjay it's a synonym, some assemblers allow that form. – harold Oct 27 '12 at 20:09
  • 2
    If it's a synonym, how do you expect it to have a different effect? – interjay Oct 27 '12 at 20:10
  • @interjay The original title of this post was misleading. The real question is in the second paragraph. I have changed the title to contain the real question. – rob mayoff Oct 27 '12 at 20:11
  • This seems like it's gonna be a nightmare to benchmark... Even it does stall, you'll have to fight with the OOE to make sure it doesn't get hidden away. – Mysticial Oct 27 '12 at 20:11
  • @Mysticial I figured making it a loop-carried dependency chain would do the trick, seems to work with `add` anyway – harold Oct 27 '12 at 20:12
  • @harold: If rob is right with his edit, please remove all reference to `ecx` from your question as it's just confusing. – interjay Oct 27 '12 at 20:12
  • @interjay he's right, and I'm bad at titles sorry, but `ecx` is still relevant isn't it? If there would be a stall, it would effectively act as though it was shifting by `ecx` instead of `cl` – harold Oct 27 '12 at 20:14

1 Answers1

5

As currently phrased (“Can a shift using the CL register …”) the question's title contains its own answer: with a modern processor, there is never a partial register stall on CL because CL can never be recombined from something smaller.

Yes, the processor knows that the amount you are shifting by is effectively contained in CL, the 5 or 6 least significant bits of CL to be precise. One way it could have stalled on ECX was if the granularity at which it considered instruction dependencies did not go below full registers. This worry is obsolete, though: the newest Intel processor that would have consider the whole ECX register as dependency was the Pentium 4. See Agner Fog's unofficial optimization manual, page 121. But then again, with the P4 this would not be called a partial register stall, the program could only be victim of a false dependency (say, if CH was modifier just before the shift).

Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • Thanks. Unfortunately by now I had figured out both the answer and that this was the worst question I ever asked. Oh well.. – harold Nov 08 '12 at 23:23
  • 4
    @harold Don't beat yourself up, you came to StackOverflow after having made some effort to measure an empirical answer yourself, and you even checked that your measures made sense by swapping in an instruction known to cause a partial register stall. Your question is a nice addition to the site if only for the methodology. – Pascal Cuoq Nov 08 '12 at 23:28