Does x86-64 support one of 16- 32- or 64-bit better than the others? What meaning do the words "native" or "extension" have?

Question

According to wiki x86-64 supports 16-bit, 32-bit and 64-bit programs. Does it support one better than the others in the sense it runs the code faster? Someone (who was probably blowing smoke up my ass) was telling me that Operteron CPUs were the first to run 64bits with still being native in 32bits.

What does it mean to be "native" in this context? I noticed on wikipedia it said 64bit is an extension to x86 so what impact does this have?

Ira Baxter · Answer 1 · 2014-02-21T09:35:01.050

5

The x86-64 is capable of executing many instructions of a wide variety of operand widths in essentially the same unit time, so instructions with 64 bit operands aren't really faster or slower than other widths as a general rule. (Some of the smaller-width instructions may actually be slowed down by partial-register write stalls).

But 64 bit instructions are more efficient in the sense that more data bits get processed per unit time. (For integer operands, I wonder how effective this is in practice: most just-plain-integer numbers are pretty small and processing the top 56 bits of all-zeros or all-ones doesn't really add value, instead it just adds heat).

The x86-64 also offers the assembly coder and the compiler an additional 8 integer registers, which helps complex loops avoid spills to memory, thus making some programs actually faster.

X86-64 runs instructions with 64 bit operands, which x86-32 simply cannot do, so there's a real qualitative difference. This allows larger individual values to be processed, and, probably the major benefit of 64 bit systems, much larger data sets without complex address space management. Technically an X86-64 can run much larger programs, but in practice nobody writes single programs big enough for this to matter.

Regarding "native": I suspect that your "someone" saying "Being native in 32 bits" merely means it will run the 32 bit instructions without any effective penalty except for the size of the operands.

I doubt about Opterons were first to do this. (Hardly anybody who claims to be first was first; consider Christopher Columbus). The mainframe guys (e.g., IBM) made transitions from 32 to 64 bits quite awhile back, allowing both 32 and 64 bit instruction forms to run "natively" on the same CPU with just a change to one bit in the PSW. The Opterons were likely the first to do this in the Intel instruction set space.

edited Feb 21 '14 at 09:35

answered Feb 21 '14 at 08:35

Ira Baxter

93,541
22
172
341

Wish I could +1 again just for the partials. False dependencies are bad – Leeor Feb 21 '14 at 09:14
I'm unfamiliar with "partial register write stalls" do they only affect x86 assembly or ARM and others too? (I'm most familiar with ARM) – Celeritas Feb 21 '14 at 09:47
I don't know the architecture of other-than-x86s very well, but I don't see how any processor that can update part of a register, can avoid a partial-write stall. A RISC chip might simply insist that updating "part" of a register (e.g., loading a byte into a 32 bit register) zeroed the rest to avoid such a stall, but then that isn't really a partial-register update. – Ira Baxter Feb 21 '14 at 10:43
1

@Celeritas I *think* ARM has at least one instruction which overwrites only some of the condition code bits (which is the same general effect). PowerPC has a similar issue in using summary overflow as one of the condition code bits (that bit becomes conceptually dependent on all previous instructions that might have set it). – Feb 21 '14 at 21:27
1

x86-64 programs can also rely on SSE support (i.e., a machine that can run x86-64 code can use SSE instructions but a machine that can run x86-32 code may not support SSE). Also, I suspect that 16-bit code modes do not support instructions added to x86-64 (which, if I understand/remember correctly, are supported in 32-bit mode); for some code SIMD and specialized instructions can substantially increase performance. (Opteron--AMD K8--*was* the first x86-64 implementation, but Itanium--earlier implementation--had very weak [but still "native"] support for direct x86-32 binary execution.) – Feb 21 '14 at 21:43
@IraBaxter: the simple option (AMD, and Intel P4 and Silvermont) is for partial writes to have a dependency on the contents of the full register. Then reading a full register after a partial write has no penalty, because it's already merged. Intel Haswell has no partial-register penalties at all, while still renaming partial registers separately to avoid the false dependency for code that uses 8b and 16b regs without later reading the 32 or 64b reg. It's unknown how they pull this off. Agner Fog speculates "dual bookkeeping". – Peter Cordes Feb 03 '16 at 13:00

Peter Cordes · Answer 2 · 2019-05-21T19:55:16.827

Oops, I misread the question. I was answering "which size of operations is most efficient in 64bit mode". See below for that answer. >.<

There aren't any CPUs where it's not worth using 64bit mode, if the CPU supports it at all. Atom/Silvermont might be on the edge, since they can slow down when too many prefix bytes are needed on an instruction, and REX counts. (So do the required prefix-bytes that are really part of the opcodes for SSE instructions.) As I understand it, 64bit is still a net win for them, but possibly not as big a win.

Low-memory systems can sometimes do better with a 32bit OS than a 64bit OS. Some of that is that 64bit OSes still need to have copies of the 32bit libraries, so they can run 32 or 64bit programs. Windows especially will tend to have both 32 and 64bit processes running at all times, so both versions of many libraries will actually be in memory at once, not just on disk. I haven't measured to see if Linux or Windows is worse about using more memory at a bare desktop when going from 32 to 64bit, but at least a Linux desktop won't have any 32bit processes that can't share the same 32bit libraries everything else is using. This paragraph is way off topic for SO, sorry.

In practice, 32bit mode is saddled with a worse ABI, and can't assume SSE2 as baseline, so those factors count against 32bit code.

Even an ideal ABI in x86-32 code that assumed AVX2 support would be hampered by the register scarcity (7 general purpose regs not including the stack pointer, and only 8 vector regs). 64bit mode has 15GP and 16 vector regs, and a new RIP-relative addressing mode mostly removes the overhead of making position-independent (library) code. The extra regs and better ABI are usually quoted as being worth about 15% performance. These factors apply specifically to x86-32 vs. x86-64, not to 32 vs. 64bit in general (like for PowerPC or SPARC: on those systems it's common for simple programs (like ls) to be 32bit). Only programs that might need more than 4GiB of address space benefit from being 64bit and being burdened with pointers that are twice as large. 64bit ARM has some design improvements over 32bit ARM, but AFAIK not nearly the leap that x86 got from AMD64.

To put it another way: what makes x86-64 good is mostly not the widening of each register to 64b, it's the other architectural improvements and the chance to make a partial break with many years of backwards compatibility (esp. in software standards. The insn set improvements could have been better, but AMD prob. wanted to make decoding as similar as possible to share transistors. They could have deprecated more of the useless instructions and added new ones. A setcc r/m32 would be really nice, and could have used two of the removed BCD opcodes. A cmovcc r, imm32 would be neat, too. Two opcodes each would do it, combined with a 3bit field in the mod/rm byte, to give the 4 bits needed to encode all 16 cc conditions. Redefining the shift instructions to always write flags, instead of conditionally not changing flags depending on shift count, would have made them cheaper, but again would have required more transistors because 32bit mode still has to be fast. So it's nowhere near a clean break with the x86 ISA's cruft, but that's not a major obstacle to high performance in modern chips.)

Linux's x32 ABI is an attempt to provide the speedups of a modern ABI and 64bit-mode without the burden of 64bit pointers. It's a big win in code with pointer-heavy data structures. (Note that even though RAM is cheap, cache is not, so smaller data structures matter.)

64bit mode (including x32) allows much more efficient copying and computations with 64bit integers. Anything that works with file sizes needs 64bit math. A lot of stuff uses 64bit numbers these days, because they're the new "big enough and everyone supports them efficiently" size. Even before 32bit mode was really obsolete, file sizes had to be 64bit, but now 64bit time values are replacing 32bit seconds since the epoch, and stuff like that. (We have to finish doing that before 2038 to avoid 32bit wraparound).

16bit mode isn't useful for anything in practice, but as I understand it modern CPUs still blaze along at full speed in 16bit mode. You're more likely to run into partial-register stalls in 16bit code, since it often uses byte registers. 16bit code for 386 also uses 32bit registers sometimes, producing more stalls (and probably length-changing prefixes for immediates bigger than 8b).

16-bit real mode running natively on a CPU can't use paging, so you never have TLB misses. (Running 16-bit code in virtual-8086 mode or 16-bit protected mode under a normal 32-bit OS will have paging enabled, though. Or even in real mode inside a VM.)

You can leave paging disabled in 32-bit protected mode, too, so this isn't really an advantage of 16-bit code. But 64-bit long mode requires paging to be enabled. You can map all of memory with a few 1GB hugepages so you'll have very few TLB misses, though.

Virtual memory / memory protection isn't something most people, especially developers, want to do without! So again, this isn't a practical advantage for 16-bit code.

previous answer: which operand-sizes are most efficient

32bit operand size is the fastest in 64bit code. There's a code-size advantage in using 32bit variables (except when an extra insn is needed to sign-extend array indices to 64bit so they can be used in addressing modes with pointers). 64bit is also cheap, but 16b and 8b. can get ugly and be much worse than just the code-size difference.

The same opcode is used for 16, 32, and 64bit operand sizes, with either an operand-size 0x66 prefix, no prefix, or REX prefix with its W field set (aka REX.W). 8bit insns have separate opcodes, so they have the same code-size advantage.

Other than that, usually all choices of operand size decode to the same number of uops (1 for most insns), with the same latency and throughput. Division is the major exception. 64bit integer division (128b/64b -> 64b) is slower even on current CPUs (esp. Intel's). Multiply is also different with different operand-sizes, esp. the one-operand N*N->2N bit form. e.g. Skylake:

mul r8: 1 uop, 3c latency (only one output register: AX=AL*src)
mul r16: 4 uops, 4c latency.
mul r32: 3 uops, 4c latency.
mul r64: 2 uops, 3c latency.

The results of 1-operand mul go in [E/R]DX:[E/R]AX, so maybe the outputs of the multipliers are wired up in a way that requires an extra uop to split the halves of a 64bit output into two regs. Even the 2 and 3 operand forms of imul r16, r/m16, imm8 are an extra uop when 16bit.

If you poke around in Agner Fog's instruction tables (search for "r32" or "r64"), you'll find other examples of things that are faster with one operand size. e.g. on Silvermont: shld r32, r32, imm is 1uop, 2c latency. At 16 and 64bit operand sizes, it's 10uops, with 10c latency. That's a really extreme case, and shows that they only did the wiring for getting bits out at the top of 32b. (Or something, I'm not a HW designer!)

Some early 64bit-capable CPUs had some limitations in 64bit mode. e.g. Core2 (Intel's 64bit P6-family design) can only macro-fuse compare-and-branch in 32bit mode. That applies regardless of operand-size though, and depends on the mode.

64bit mode was really "bolted on" in p4, where shl r32, imm is 1c latency, but shl r64, imm is 7c latency: even some simple execution units were not 32b. IIRC, that wasn't a problem for K8 Opteron. 64bit CPUs run 32bit code natively as well, even when the OS was 64b (unlike IA-64, which had either slow ia32 HW or pure emulation). Probably what you heard was a garbled 3rd-hand version of that. Although as Paul Clayton points out, the slow x86 hardware on early Itanics sort of counts as "native".

8 and 16bit operand sizes tend to create partial-register stalls on Intel CPUs (pre IvB). Writing an 8b or 16b register doesn't clear the upper bits, so there's a dependency on the previous contents of the full register. Some CPUs just make such insns wait for the full reg to be ready. Intel P6 was designed back when 16bit code was still relevant (PPro was released in Nov 1995, so design obvious started before that. Even Win95 still had significant amounts of 16bit code, I think.) This may be why Intel P6 (and later SnB-family) does register renaming on the 8 and 16b partial registers. A read of a wider reg after a write of a partial reg causes a stall (or just insertion of a merging uop: SnB-family). Or on Haswell and later, no penalty at all: All the benefit of no false dependencies, but no penalty even for writing a reg like ah and then reading eax. (IvB had no penalty for cases other than the high8 registers).

This isn't a problem with mixing 32 and 64bit, because any write to a 32b register zeros the upper32 of the full 64b. This nicely avoids the false-dependency issue. When you do need to merge 64b regs, you can just AND/OR, or use shld.

16bit instructions with 16bit-immediate operands (like add ax, 1024) also cause decoding stalls. The operand-size prefix changes the length of the rest of the instruction (from add r, imm32 to add r, imm16), and Intel decoders don't like that.

You can have virtual memory in 16-bit mode, like for example when an 16-bit MS-DOS or 16-bit Windows application is run under a 32-bit version of Windows. You probably know this already. but long mode (64-bit mode) can't be entered without first enabling paging and paging can't be disabled while in long mode. — Ross Ridge, May 21 '19 at 19:11
@RossRidge: thanks, amazing the stuff I didn't know 3 years ago. :) I meant real mode on bare metal, updated to clarify. I'm still not an osdev expert, especially about things that Linux doesn't do; I didn't know or had forgotten that long mode required paging. — Peter Cordes, May 21 '19 at 19:57

Does x86-64 support one of 16- 32- or 64-bit better than the others? What meaning do the words "native" or "extension" have?

2 Answers2

Linked