Why run 16-bit programs on x86 operating systems get slower?

Question

I'm studying some things about assembly, and the material I'm reading, the author said that programs compiled for 16-bit rotate more slowly on x86 operating systems and the same goes for x64, 32bit compiled programs run slower on x64...

Why does this happen? What happens in the computer memory and the processor, so that programs 16bits or 32bits machines rotate more slowly in 32 bits and 64 bits, respectively ?

Short answer: because the CPU is optimized for the benefit of recently built code, at the expense of old, legacy code. — Seva Alekseyev, Nov 22 '13 at 16:23

score 2 · Accepted Answer · answered Nov 22 '13 at 15:27

About 16 bit programs running slower in 32 bits systems, I can tell you about that. When Intel went from 16 bits to 32 bits, they had to expand the instruction set to cope with the new 32-bit registers, but maintaining binary compatibility with 16-bit programs.

To accomplish that, they added a prefix, 66h if I remember well, that when applied to any instruction that uses 16 bit registers, makes that instruction to use 32 bit registers.

For instance, a 16-bit instruction, like MOV AX,BX, prefixed with 66h, turns into MOV EAX,EBX

But this then imposes a penalty on the new 32 bit instructions, because they need at least an extra memory fetch cycle to get executed. Intel then created the so called 32-bit segments and 16-bit segments.

Basically, any piece of code must reside in a code segment. Before the 80386, all segments used 16-bit instructions, and all instructions are assumed to use 16 bit registers.

Intel's 32-segment contain code as well, but this time every instruction is assumed to use 32 bit registers, so in a 32-bit segment, the opcode of MOV EAX,EBX is the same as the opcode of MOV AX,BX in a 16-bit segment.

This allows a program to not having to use the 66h prefix for every 32-bit instruction. There's no penalty anymore.

But... what if I have to use 16-bit registers within a program that is conained into a 32-bit segment? Those instructions using 16-bit registers will have to use the prefix 66h.

So: instructions that use 16-bit registers are unprefixed in 16-bit segments and prefixed in 32-bit semgnts. Instructions that use 32-bit registers are unprefixed in 32-bit segments and prefixed in 16-bit segments.

Besides: starting with the Pentium processor, we have two pipelines for executing instructions in parallel. For these pipelines to be used, instructions entering them must belong to what Intel names "RISC nucleus": a subset of instructions that are no longer executed as a microprogram inside the CPU, but using wired logic. Guess what? Prefixed instructions, and code executing in a 16-bit segment using 16-bit registers don't belong to this group and therefore, cannot execute in parallel with another one. When a prefixed instruction manages to enter one of the pipelines, the other is stalled, thus affecting the perfomance of the CPU.

score 1 · Answer 2 · answered Nov 22 '13 at 20:25

About "programs rotate more slowly"... Well... programs don't "rotate", but "are executed". If you are talking about the bit rotation instruction... well. It happens that the 8086 has two versions of the bit rotating instruction: one that uses an inmediate argument that specifies the number of bits to rotate, and other one that uses a register (usually CX / ECX) to specify this.

The thing is that 8086 processors don't allow any other value than 1 for the inmediate argument (but the value in CX/ECX can be greater than 1). 80386 and higher processors allow using any other value as inmediate operand. Besides, 32-bit processors use only lower 5 bits of the operand that specifies the amount of rotating, so the operation don't exceed 31 (it's pointless to rotate a 32-bit reigster more than 31 times). 8086 processors don't impose this limit and therefore, spend more time in the operation.

I don't really know if this is what your book mean by "rotating more slowly". I recall the rotating operation can only be performed in one of the pipelines, not both, so two consecutive rotating instructions can not be paired.

score 0 · Answer 3 · edited May 23 '17 at 10:31

I'm not sure what you mean by rotate (the assembly operations?), but in general there could be several factors here -

CPU companies don't really go to the effort of supporting old legacy modes and ISA subsets. x87 is a good example, anything that doesn't really require that level of precision is better off using SSE/AVX for performance critical tasks, and not just because of vectorization.
Every time the x86 CPU companies increased their register sizes, they kept the old register set and just added logical names for the longer versions. The need for compatibility demanded that old operation can still work on the same registers, so you can now write to ah/al, ax, eax and rax in the same program. In some of these cases (namely - the 8bit/16bit partials), this compatibility would require your CPU to keep the upper parts of the register intact when writing only to the lower part, doing this would introduce a merge operations implicitly, which may cause slowdowns. Worse, you could introduce false dependencies as each write to the 16bit register would require you to merge in the upper part that remained from earlier operations.

See also here - Why do most x64 instructions zero the upper part of a 32 bit register

Why run 16-bit programs on x86 operating systems get slower?

3 Answers3

Linked