Oops, I misread the question. I was answering "which size of operations is most efficient in 64bit mode". See below for that answer. >.<
There aren't any CPUs where it's not worth using 64bit mode, if the CPU supports it at all. Atom/Silvermont might be on the edge, since they can slow down when too many prefix bytes are needed on an instruction, and REX counts. (So do the required prefix-bytes that are really part of the opcodes for SSE instructions.) As I understand it, 64bit is still a net win for them, but possibly not as big a win.
Low-memory systems can sometimes do better with a 32bit OS than a 64bit OS. Some of that is that 64bit OSes still need to have copies of the 32bit libraries, so they can run 32 or 64bit programs. Windows especially will tend to have both 32 and 64bit processes running at all times, so both versions of many libraries will actually be in memory at once, not just on disk. I haven't measured to see if Linux or Windows is worse about using more memory at a bare desktop when going from 32 to 64bit, but at least a Linux desktop won't have any 32bit processes that can't share the same 32bit libraries everything else is using. This paragraph is way off topic for SO, sorry.
In practice, 32bit mode is saddled with a worse ABI, and can't assume SSE2 as baseline, so those factors count against 32bit code.
Even an ideal ABI in x86-32 code that assumed AVX2 support would be hampered by the register scarcity (7 general purpose regs not including the stack pointer, and only 8 vector regs). 64bit mode has 15GP and 16 vector regs, and a new RIP-relative addressing mode mostly removes the overhead of making position-independent (library) code. The extra regs and better ABI are usually quoted as being worth about 15% performance.
These factors apply specifically to x86-32 vs. x86-64, not to 32 vs. 64bit in general (like for PowerPC or SPARC: on those systems it's common for simple programs (like ls
) to be 32bit). Only programs that might need more than 4GiB of address space benefit from being 64bit and being burdened with pointers that are twice as large. 64bit ARM has some design improvements over 32bit ARM, but AFAIK not nearly the leap that x86 got from AMD64.
To put it another way: what makes x86-64 good is mostly not the widening of each register to 64b, it's the other architectural improvements and the chance to make a partial break with many years of backwards compatibility (esp. in software standards. The insn set improvements could have been better, but AMD prob. wanted to make decoding as similar as possible to share transistors. They could have deprecated more of the useless instructions and added new ones. A setcc r/m32
would be really nice, and could have used two of the removed BCD opcodes. A cmovcc r, imm32
would be neat, too. Two opcodes each would do it, combined with a 3bit field in the mod/rm byte, to give the 4 bits needed to encode all 16 cc conditions. Redefining the shift instructions to always write flags, instead of conditionally not changing flags depending on shift count, would have made them cheaper, but again would have required more transistors because 32bit mode still has to be fast. So it's nowhere near a clean break with the x86 ISA's cruft, but that's not a major obstacle to high performance in modern chips.)
Linux's x32 ABI is an attempt to provide the speedups of a modern ABI and 64bit-mode without the burden of 64bit pointers. It's a big win in code with pointer-heavy data structures. (Note that even though RAM is cheap, cache is not, so smaller data structures matter.)
64bit mode (including x32) allows much more efficient copying and computations with 64bit integers.
Anything that works with file sizes needs 64bit math. A lot of stuff uses 64bit numbers these days, because they're the new "big enough and everyone supports them efficiently" size. Even before 32bit mode was really obsolete, file sizes had to be 64bit, but now 64bit time values are replacing 32bit seconds since the epoch, and stuff like that. (We have to finish doing that before 2038 to avoid 32bit wraparound).
16bit mode isn't useful for anything in practice, but as I understand it modern CPUs still blaze along at full speed in 16bit mode. You're more likely to run into partial-register stalls in 16bit code, since it often uses byte registers. 16bit code for 386 also uses 32bit registers sometimes, producing more stalls (and probably length-changing prefixes for immediates bigger than 8b).
16-bit real mode running natively on a CPU can't use paging, so you never have TLB misses. (Running 16-bit code in virtual-8086 mode or 16-bit protected mode under a normal 32-bit OS will have paging enabled, though. Or even in real mode inside a VM.)
You can leave paging disabled in 32-bit protected mode, too, so this isn't really an advantage of 16-bit code. But 64-bit long mode requires paging to be enabled. You can map all of memory with a few 1GB hugepages so you'll have very few TLB misses, though.
Virtual memory / memory protection isn't something most people, especially developers, want to do without! So again, this isn't a practical advantage for 16-bit code.
previous answer: which operand-sizes are most efficient
32bit operand size is the fastest in 64bit code. There's a code-size advantage in using 32bit variables (except when an extra insn is needed to sign-extend array indices to 64bit so they can be used in addressing modes with pointers). 64bit is also cheap, but 16b and 8b. can get ugly and be much worse than just the code-size difference.
The same opcode is used for 16, 32, and 64bit operand sizes, with either an operand-size 0x66
prefix, no prefix, or REX
prefix with its W field set (aka REX.W
). 8bit insns have separate opcodes, so they have the same code-size advantage.
Other than that, usually all choices of operand size decode to the same number of uops (1 for most insns), with the same latency and throughput. Division is the major exception. 64bit integer division (128b/64b -> 64b) is slower even on current CPUs (esp. Intel's). Multiply is also different with different operand-sizes, esp. the one-operand N*N->2N bit form. e.g. Skylake:
mul r8
: 1 uop, 3c latency (only one output register: AX=AL*src)
mul r16
: 4 uops, 4c latency.
mul r32
: 3 uops, 4c latency.
mul r64
: 2 uops, 3c latency.
The results of 1-operand mul
go in [E/R]DX:[E/R]AX, so maybe the outputs of the multipliers are wired up in a way that requires an extra uop to split the halves of a 64bit output into two regs. Even the 2 and 3 operand forms of imul r16, r/m16, imm8
are an extra uop when 16bit.
If you poke around in Agner Fog's instruction tables (search for "r32" or "r64"), you'll find other examples of things that are faster with one operand size. e.g. on Silvermont: shld r32, r32, imm
is 1uop, 2c latency. At 16 and 64bit operand sizes, it's 10uops, with 10c latency. That's a really extreme case, and shows that they only did the wiring for getting bits out at the top of 32b. (Or something, I'm not a HW designer!)
Some early 64bit-capable CPUs had some limitations in 64bit mode. e.g. Core2 (Intel's 64bit P6-family design) can only macro-fuse compare-and-branch in 32bit mode. That applies regardless of operand-size though, and depends on the mode.
64bit mode was really "bolted on" in p4, where shl r32, imm
is 1c latency, but shl r64, imm
is 7c latency: even some simple execution units were not 32b. IIRC, that wasn't a problem for K8 Opteron. 64bit CPUs run 32bit code natively as well, even when the OS was 64b (unlike IA-64, which had either slow ia32 HW or pure emulation). Probably what you heard was a garbled 3rd-hand version of that. Although as Paul Clayton points out, the slow x86 hardware on early Itanics sort of counts as "native".
8 and 16bit operand sizes tend to create partial-register stalls on Intel CPUs (pre IvB). Writing an 8b or 16b register doesn't clear the upper bits, so there's a dependency on the previous contents of the full register. Some CPUs just make such insns wait for the full reg to be ready. Intel P6 was designed back when 16bit code was still relevant (PPro was released in Nov 1995, so design obvious started before that. Even Win95 still had significant amounts of 16bit code, I think.) This may be why Intel P6 (and later SnB-family) does register renaming on the 8 and 16b partial registers. A read of a wider reg after a write of a partial reg causes a stall (or just insertion of a merging uop: SnB-family). Or on Haswell and later, no penalty at all: All the benefit of no false dependencies, but no penalty even for writing a reg like ah
and then reading eax
. (IvB had no penalty for cases other than the high8 registers).
This isn't a problem with mixing 32 and 64bit, because any write to a 32b register zeros the upper32 of the full 64b. This nicely avoids the false-dependency issue. When you do need to merge 64b regs, you can just AND/OR, or use shld
.
16bit instructions with 16bit-immediate operands (like add ax, 1024
) also cause decoding stalls. The operand-size prefix changes the length of the rest of the instruction (from add r, imm32
to add r, imm16
), and Intel decoders don't like that.