I found that contrary to its binary / bi-state nature, x86 CPUs are very slow when processing binary manipulations instructions such as SHR, BT, BTR, ROL and something similar.
For example, I've read it from somewhere that bit shifting / rotate more than 1 positions is considered slow (with high-latency, performance penalty and those scary stuff). It's even worse when the operands are in memory (Aren't memory bi-state peripherals, too?)
shl eax,1 ;ok
shl eax,7 ;slow?
So what's making them slow? It's kind of ironic that binary machines like the CPUs are slow on bit manipulations when such operations are supposed to be natural. It gives the impression that a binary CPU is having a hard time shifting bits in place!
EDIT: Now after having a second look at SHL entry in the manual, it does involve some heavy microcode logics!
From Intel's vol.2 manual for shl
...
Operation
TemporaryCount = Count & 0x1F;
TemporaryDestination = Destination;
while(TemporaryCount != 0) {
if(Instruction == SAL || Instruction == SHL) {
CF = MSB(Destination);
Destination = Destination << 1;
}
//instruction is SAR or SHR
else {
CF = LSB(Destination);
if(Instruction == SAR) Destination = Destination / 2; //Signed divide, rounding toward negative infinity
//Instruction is SHR
else Destination = Destination / 2; //Unsigned divide
}
TemporaryCount = TemporaryCount - 1;
}
//Determine overflow
if(Count & 0x1F == 1) {
if(Instruction == SAL || Instruction == SHL) OF = MSB(Destination) ^ CF;
else if(Instruction == SAR) OF = 0;
//Instruction == SHR
else OF = MSB(TemporaryDestination);
}
else OF = Undefined;
Unbelievable to see that such a simple boolean algebra is turned into an implementation nightmare.