About 16 bit programs running slower in 32 bits systems, I can tell you about that.
When Intel went from 16 bits to 32 bits, they had to expand the instruction set to cope with the new 32-bit registers, but maintaining binary compatibility with 16-bit programs.
To accomplish that, they added a prefix, 66h if I remember well, that when applied to any instruction that uses 16 bit registers, makes that instruction to use 32 bit registers.
For instance, a 16-bit instruction, like MOV AX,BX, prefixed with 66h, turns into MOV EAX,EBX
But this then imposes a penalty on the new 32 bit instructions, because they need at least an extra memory fetch cycle to get executed. Intel then created the so called 32-bit segments and 16-bit segments.
Basically, any piece of code must reside in a code segment. Before the 80386, all segments used 16-bit instructions, and all instructions are assumed to use 16 bit registers.
Intel's 32-segment contain code as well, but this time every instruction is assumed to use 32 bit registers, so in a 32-bit segment, the opcode of MOV EAX,EBX is the same as the opcode of MOV AX,BX in a 16-bit segment.
This allows a program to not having to use the 66h prefix for every 32-bit instruction. There's no penalty anymore.
But... what if I have to use 16-bit registers within a program that is conained into a 32-bit segment? Those instructions using 16-bit registers will have to use the prefix 66h.
So: instructions that use 16-bit registers are unprefixed in 16-bit segments and prefixed in 32-bit semgnts. Instructions that use 32-bit registers are unprefixed in 32-bit segments and prefixed in 16-bit segments.
Besides: starting with the Pentium processor, we have two pipelines for executing instructions in parallel. For these pipelines to be used, instructions entering them must belong to what Intel names "RISC nucleus": a subset of instructions that are no longer executed as a microprogram inside the CPU, but using wired logic. Guess what? Prefixed instructions, and code executing in a 16-bit segment using 16-bit registers don't belong to this group and therefore, cannot execute in parallel with another one. When a prefixed instruction manages to enter one of the pipelines, the other is stalled, thus affecting the perfomance of the CPU.