2

I found that contrary to its binary / bi-state nature, x86 CPUs are very slow when processing binary manipulations instructions such as SHR, BT, BTR, ROL and something similar.

For example, I've read it from somewhere that bit shifting / rotate more than 1 positions is considered slow (with high-latency, performance penalty and those scary stuff). It's even worse when the operands are in memory (Aren't memory bi-state peripherals, too?)

shl eax,1  ;ok
shl eax,7  ;slow?

So what's making them slow? It's kind of ironic that binary machines like the CPUs are slow on bit manipulations when such operations are supposed to be natural. It gives the impression that a binary CPU is having a hard time shifting bits in place!

EDIT: Now after having a second look at SHL entry in the manual, it does involve some heavy microcode logics!

From Intel's vol.2 manual for shl...

Operation
TemporaryCount = Count & 0x1F;
TemporaryDestination = Destination;
while(TemporaryCount != 0) {
    if(Instruction == SAL || Instruction == SHL) {
        CF = MSB(Destination);
        Destination = Destination << 1;
    }
    //instruction is SAR or SHR
    else {
        CF = LSB(Destination);
        if(Instruction == SAR) Destination = Destination / 2; //Signed divide, rounding toward negative infinity
        //Instruction is SHR
        else Destination = Destination / 2; //Unsigned divide
    }
    TemporaryCount = TemporaryCount - 1;
}
//Determine overflow
if(Count & 0x1F == 1) {
    if(Instruction == SAL || Instruction == SHL) OF = MSB(Destination) ^ CF;
    else if(Instruction == SAR) OF = 0;
    //Instruction == SHR
    else OF = MSB(TemporaryDestination);
}
else OF = Undefined;

Unbelievable to see that such a simple boolean algebra is turned into an implementation nightmare.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
royalfinest
  • 155
  • 1
  • 8
  • 1
    Chip design is all about trade-offs. If something is slow in your book it's because something else was deemed more worthy of real estate on the chip. – 500 - Internal Server Error Dec 28 '18 at 15:52
  • 1
    compilers don't use `shl eax,1`, they'd use `add eax,eax` because it's just as short and can run on more execution ports. Fun fact: variable-count shifts *are* slightly slower on Intel CPUs: (3 uops on Intel Sandybridge family, or with a weird potential stall if the flag-result is used on Intel P6-family). See [INC instruction vs ADD 1: Does it matter?](https://stackoverflow.com/q/36510095) for more about the cumbersome x86 semantics where an immediate or variable count shift has to leave flags unmodified if count=0, so the instruction needs a dependency on the old flags. – Peter Cordes Dec 30 '18 at 06:06

2 Answers2

9

That is just the pseudo-code for the instruction, specifying exactly what it does. The instruction is not actually implemented like this. In practice, all modern CPUs have barrel shifters or similar, allowing them to shift by arbitrary amounts in a single cycle. See for example Agner Fog's tables where it shows a latency of 1 for almost all bit-fiddling instructions.

A few bit-fiddling instructions are slower, here are some examples:

  • bt, btr, bts, and btc are slower when used with memory operands because of the (a) read-modify-write operation and (b) the bitstring indexing they do
  • rcr with a rotate amount of more than 1 is slow because this instruction is almost never needed and thus not optimised
  • pdepand pext are slightly slower on Intel and much slower on AMD, probably because their implementation is pretty involved and splitting the implementation up makes it easier.

On old processors (say, an 8086), the CPU would take as many cycles as the shift amount was, doing one shift every cycle. This kind of implementation allows the ALU to be used for shifting without any extra hardware, reducing the number of gates needed for the processor. No modern CPU I know of has this performance behaviour.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • `bt` doesn't have a read-modify-write operation, right? (I assume that's a typo?) – user541686 May 31 '21 at 08:24
  • @user541686 Indeed it doesn't, but (b) still applies of course. – fuz May 31 '21 at 08:35
  • With respect to (b) for `bt`, is it any slower than the usual alternatives (like `x & (1 << n)` and such)? – user541686 May 31 '21 at 09:55
  • @user541686 If a memory operand is used, yes. This is because `bt` supports indexing into bit strings and computing the address for that takes extra cycles. – fuz May 31 '21 at 10:39
2

Just a note.

shl eax,1 ; opcode: d1 e0
shl eax,7 ; opcode: c1 e0 07

are actually different instructions with different opcodes which are handled potentially by different logic blocks of ALU. They use the same mnemonic in assembly and that can confuse, but from the viewpoint of CPU, they are different instructions with different opcodes and encodings.

ZarathustrA
  • 3,434
  • 32
  • 28
  • 1
    Some assemblers distinguish `shl eax` from `shl eax,1` for this reason. – fuz Dec 30 '18 at 13:53
  • Could you mention if their performance characteristics are different? (I assume this might matter for superscalar CPUs?) – user541686 May 31 '21 at 08:26
  • The author of the question claims that on his computer, their performance differs significantly. He asks about the reason. I just highlighted the potential reason for that. From the assembler's viewpoint, it is entirely not evident that these are two different instructions. Note! Be aware that the performance of instructions execution is not a part of ISA specification and is processor model specific! – ZarathustrA Jun 02 '21 at 06:54