Intel X86 -- How is memory accessing / writing affected by a machine being '32-bit'?

Question

Wikipedia says:

32-bit CPU and ALU architectures are those that are based on registers, address buses, or data buses of that size.

So each memory address refers to 32-bits.

But what does that really mean, exactly? In Intel X86-32 assembly, which is agreed to be 32-bit, there are 32-bit registers, 16-bit registers, and 8-bit registers.

Consider the following assembly

mov ax, bx  ; move 16-bit bx into 16-bit ax
mov ah, bh  ; move 8-bit bh into 8-bit ah
movzx eax, ax ; move 16-bit ax into 32-bit eax (zero-extending ax)

Are all these lines acceptable? If so, why do we insist that this is a 32-bit system? Inexorably, the assembly manages smaller address sizes.

Yes all of those are legal. Note that the smaller registers are all part of the bigger ones. 32 bit is simply the native "biggest" size. You got the "each memory address refers to 32-bits" part wrong though. — Jester, Mar 17 '18 at 01:04
Possible duplicate of [What are 16, 32 and 64-bit architectures?](https://stackoverflow.com/questions/3594394/what-are-16-32-and-64-bit-architectures) — phuclv, Mar 17 '18 at 02:01
32-bit x86 also has 64-bit MMX registers, and up to 512-bit AVX512 ZMM registers. Only the general-purpose integer registers are 32-bit. [The connection between L1d cache and execution units is up to 512 bits wide](https://electronics.stackexchange.com/questions/329789/how-can-cache-be-that-fast/329955#329955) (in Skylake-AVX512). — Peter Cordes, Mar 17 '18 at 02:38
A further complication is that PAE is supported since the Pentium Pro, which allows 36-bit physical addressing, but still 32-bit virtual addresses. — Nayuki, Mar 17 '18 at 04:06
For simple processors, the bit-width is a general - but seldom perfect - characterization of the largest numbers that can be computed and/or stored with a single instruction. The more complicated a processor, the more imperfect this characterization. As @PeterCordes has pointed out, for modern general purpose processors the bit width is nearly meaningless: its exceptions become the rule. Hence, the fuzziness of the Wikipedia page. — Gene, Mar 17 '18 at 05:58

Mike Nakis · Answer 1 · 2018-03-17T01:39:29.373

3

Wikipedia's explanation is a bit fuzzy, but then again this is a subject which is difficult to receive a precise definition. Take for example the term "passenger car". Can you define it precisely? No.

It used to be that a 32-bit architecture was one that had a 32-bit data bus, but things are quite a bit more complicated nowadays.

A working definition today would be that the bitness of the architecture tends to coincide with the size in bits of the largest available general-purpose registers.

So, a 32-bit system can be expected to have 8-bit, 16-bit, and 32-bit registers but (usually) no larger registers. Similarly, a 16-bit system can be expected to have 8-bit and 16-bit registers, but (usually) no larger registers.

edited Mar 17 '18 at 01:39

answered Mar 17 '18 at 01:10

Mike Nakis

56,297
11
110
142

The size of the data bus is quite irrelevant and memory access size even more so. – Jester Mar 17 '18 at 01:12
@Jester can you provide a reference to support this? – Mike Nakis Mar 17 '18 at 01:13
2

For example, the 386SX had 16 bit data bus but it was a 32 bit processor. On the other hand, current x86 cpus use cache lines for accessing main memory which are way more than the "bitness". – Jester Mar 17 '18 at 01:22
The 386SX was a weird, intentionally crippled chip, so it has to be regarded as an exception, and my answer does allow for exceptions. The original 386 architecture had a 32-bit data bus. – Mike Nakis Mar 17 '18 at 01:26
1

@Jester As for the cache lines, you are probably right. Modern CPUs have become so much more complicated. – Mike Nakis Mar 17 '18 at 01:27
@Jester but even in a modern system, isn't there an internal data bus between the registers and the cache? – Mike Nakis Mar 17 '18 at 01:28
Just googling around, the ARM Cortex-A8 seems to support 128 bit data bus but it isn't a 128 bit cpu. As for internal bus sizes, those things are complicated. You have store buffers and SIMD which use larger registers than your "bitness" so your internal data bus needs to be able to handle that too. – Jester Mar 17 '18 at 01:35
@Jester Okay, I changed my answer. Thanks. – Mike Nakis Mar 17 '18 at 01:40
x86 CPUs have 64-bit external data bus for a very long time before x86_64 was invented. I don't know how big is data bus nowadays but AFAIK dual channel memory will have 128-bit bus, and 3/4-channel will have 192/256 bit respectively – phuclv Mar 17 '18 at 02:00
1

32-bit x86 also has 64-bit MMX registers, and up to 512-bit AVX512 ZMM registers. **Only the general-purpose integer registers are 32-bit**. (And of course on 64-bit capable CPUs, they're physically stored in the low half of 64-bit physical registers, just like when using 32-bit operand-size in 64-bit mode.) [The connection between L1d cache and execution units is up to 512 bits wide](https://electronics.stackexchange.com/questions/329789/how-can-cache-be-that-fast/329955#329955) (in Skylake-AVX512). – Peter Cordes Mar 17 '18 at 02:38
There are many issues with this answer: "It used to be that a 32-bit architecture was one that had a 32-bit data bus" Not at all. Most of the 32-bit architectures between 60s-80s didn't have 32-bit data buses, "but things are quite a bit more complicated nowadays" I think they have always been complicated, "the size in bits of the largest available general-purpose registers" The architectural registers or the physical registers? What's a GP register in your definition? "can be expected to have 8-bit, 16-bit, and 32-bit registers but no larger registers" Sounds like an arbitrary statement. – Hadi Brais Mar 17 '18 at 02:38
1

Your definition doesn't sound any better than Wikipedia's. – Hadi Brais Mar 17 '18 at 02:41
1

Guys, sure, this answer has issues. But the alternative was writing a dissertation like Peter Cordes did, which is "too much information" and full of unknown words for the OP, who is just wondering what's the deal with the smaller-than-32-bit registers. – Mike Nakis Mar 17 '18 at 11:26

Peter Cordes · Answer 2 · 2018-03-17T05:19:29.663

x86-32 (aka IA-32) is a 32-bit extension to 16-bit 8086, which was designed for easy porting of asm source from 8-bit 8080 to 8086. (Why are first four x86 GPRs named in such unintuitive order? on retrocomputing).

This history is why modern x86 has so much partial-register stuff, with direct support for 8 and 16-bit operand-size.

Most other architectures with 32-bit registers only allow narrow loads/stores, with ALU operations being only full register width. (But that's as much because they're RISC architectures (MIPS, SPARC, and even the slightly less-RISCy ARM), while x86 is definitely a CISC architecture.)

The 64-bit extensions to RISC architectures like MIPS still support 32-bit operations, usually implicitly zero-extending 32-bit results into the "full" registers the same way x86-64 does. (Especially if 64-bit isn't a new mode, but rather just new opcodes within the same mode, with semantics designed so that existing machine code will run the same way when addressing modes use the full registers but all the legacy opcodes still only write the low 32 bits.)

So the situation you observe on x86-32 (with narrow operations on partial registers being supported) is present in all architectures that exist as a wider extension to an older architecture, whether it runs in a new mode (where machine code decodes differently) or not. It's just that x86 ancestry goes back to 16-bit within x86, and back to 8-bit as an influence on 8086.

Motorola 68000 has 32-bit registers, according to Wikipedia "the main ALU" is only 16-bit. (Maybe 32-bit operations are slower or some are not supported, but definitely 32-bit add/and instructions are supported. I don't know the details behind why Wikipedia says that).

Originally 68000 was designed to work with a 16-bit external bus, so 16-bit loads/stores were more efficient on those early CPUs. I think later 68k CPUs widened the data buses, making 32-bit load/store as fast as 16-bit. Anyway, I think m68k is another example of a 32-bit architecture that supports a lot of 16-bit operations. Wikipedia describes it as "a 16/32-bit CISC microprocessor".

With the addition of caches, twice as many 16-bit integers fit in a cache line as 32-bit integers, so for sequential access 16-bit only costs half as much average / sustained memory bandwidth. Talking about "bus width" gets more complicated when there's cache, so there's a bus or internal data path between the load/store units and cache, and between cache and memory. (And in a multi-level cache, between different levels of cache).

Deciding whether to call an architecture (or a specific implementation of that architecture) 8 / 16 / 32 / 64-bit is pretty arbitrary. The marketing department is likely to pick the widest thing they can justify, and use that in descriptions of the CPU. That might be a data bus or a register width, or address space or anything else. (Many 8-bit CPUs use 16-bit addresses in a concatenation of two 8-bit registers, although most of them don't try to claim to be 16-bit. They might be advertized as 8/16-bit though.)

32-bit x86 is considered 32-bit because that's the maximum width of a pointer, or a "general purpose" integer register. 386 added several major new things: 32-bit integer registers / operand size (accessible with prefixes from real mode), and 32-bit protected mode with virtual memory paging, where the default address and operand sizes are 32 bits.

The physical CPUs that can run IA-32 machine code today have vastly wider buses and better memory bandwidth than the first generation 386SX CPUs, but they still support the same IA-32 architecture (plus extensions).

These days, essentially all new x86 CPUs can also run in x86-64 mode. When running in IA-32 mode, a modern x86 CPU will only be using the low 32 bits of its 64-bit physical integer registers (like for instructions that use 32-bit operand-size in 32-bit or 16-bit mode).

But besides the integer registers, there are the 80-bit x87 registers (which can be used as 64-bit integer-SIMD MMX registers), and also the XMM / YMM / ZMM registers (SSE / AVX / AVX512).

SSE2 is baseline for x86-64, and can be assumed in most 32-bit code these days, so at least 128-bit registers are available, and can be used for 64-bit integer add/sub/shift even in 32-bit mode with instructions like paddq.

Modern CPUs also have at least 128-bit connections between the vector load/store units and cache, so load/store/copy bandwidth when data fits in L1d cache is not limited by the external dual/triple/quad-channel DDR3/DDR4 DRAM controllers (which do burst transfers of 8x 64-bits = one 64-byte cache line over 64-bit external buses).

Instead, CPUs have large fast caches, including a shared L3 cache so data written by one core and read by another doesn't usually have to go through memory if it's still hot in L3. See some details on how cache can be that fast for Intel IvyBridge, which has only 128-bit load/store paths even though it supports 256-bit AVX instructions. Haswell widened the load/store paths to 256-bit as well. Skylake-AVX512 widened the registers and data paths to 512-bit for L1d cache, and the connection between L1d and L2.

But on paper, x86 (since P5 Pentium and later) only guarantees that aligned loads/stores up to 64 bits are atomic, so implementations with SSE are allowed to split 128-bit XMM loads/stores into two 64-bit halves. Pentium III and Pentium M actual did this. But note that i586 Pentium predated x86-64 by a decade, and the only way it could load/store 64 bits was with x87 fld or fild. Pentium MMX could do 64-bit MMX movq loads/stores. Anyway, this atomicity guarantee includes uncached stores (e.g. for MMIO), which was possible (cheaply, without a bus-lock) because the P5 microarchitecture has a 64-bit external bus, even though it's strictly 32-bit other than the FPU.

Even pure integer code benefits from the wide data paths because it increases bandwidth for integer code with loads/stores that hit in L3 or especially L2 cache, but not L1d cache.

All these SIMD extensions to x86 make it vastly more powerful than a purely 32-bit integer architecture. But when running in 32-bit mode, it's still the same mode introduced by 386, and we call it 32-bit mode. It's as good a name as any, but don't try to read too much into it.

In fact, don't read anything into it except for integer / pointer register widths. The hardware it runs on typically has 64-bit integer registers, and 48-bit virtual address space. And data buses + caches of various huge widths, and complex out-of-order machinery to give the illusion of running in-order while actually looking at a window of up to 224 uops to find instruction-level parallelism. (Skylake / Kaby Lake / Coffee Lake ROB size).

I think that defining a 32-bit architecture as *one in which the full size of an architectural scalar integer register is 32-bit* would encompass all "32-bit architectures" in history. Not sure if it can be made more precise than that. — Hadi Brais, Mar 17 '18 at 05:07
@HadiBrais: yeah, I think so, but according to Wiki, m68k's "main ALU" is only 16 bits wide, and it describes it as a 16/32-bit architecture. It definitely has 32-bit registers and 32-bit ADD / AND instructions at least. But yeah, I think it's more with 8 and 16 bit that there's more blurring (because 8-bit is too small for an address size, so most 8-bit CPUs have some kind of 16-bit something going on). — Peter Cordes, Mar 17 '18 at 05:24
But anyway, the TL:DR for this answer is that the IA-32 instruction set on paper doesn't tell you anything about the memory system in CPUs that can run it, answering the question title. And more importantly that knowing that a CPU is "32-bit" doesn't tell you much about it, except that 32-bit integers are probably handled about as efficiently as 8 or 16-bit (except for loading and storing them). — Peter Cordes, Mar 17 '18 at 05:25
Exactly right. That's why I used the term *architectural* in my proposed definition. Although the *scalar integer* part might seem not very precise. — Hadi Brais, Mar 17 '18 at 05:37
The 68k was a 32-bit *design* that, due to chip size limitations of the 1970s, had to use a 16-bit ALU and run data two rounds through that. Later the 68020 implemented the same design with 32 bit buses everywhere. — Bo Persson, Mar 17 '18 at 12:15
@BoPersson: Thanks. So a fully 32-bit ISA with 16 and 32-bit ALU instructions. So it's a great example of how the paper architecture doesn't tell you crap all about the internal implementation of CPUs that can run it, just register width. (And even addressable memory can be less than pointer width; real CPUs don't always implement all the address bits) — Peter Cordes, Mar 17 '18 at 13:13

Intel X86 -- How is memory accessing / writing affected by a machine being '32-bit'?

2 Answers2