3

In x86 Assembler, given that you have

  • Immediate addressing mode for allocating numbers
  • Register addressing mode for registers
  • Direct addressing mode for memory addresses,

why do you need Indexed and Base Pointer addressing modes? Each could be replaced by a loop as far as I know.

Also Indirect mode doesn't seem to be overly useful either, since you can simply use Direct mode instead to reference the memory address. What is the purpose of first accessing a register which then contains a pointer to a memory address?

In short, which addressing modes are really necessary?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
AdHominem
  • 1,204
  • 3
  • 13
  • 32
  • 3
    There are many things that can be replaced with other things. Why do you need `inc` or `sub` if you have `add`. Indirect mode, however, is not easily replaced unless you use self-modifying code. Try to implement pointers without it. – Jester Feb 05 '16 at 11:28

2 Answers2

8

Although in theory 'addressing mode' can be used to refer to the operand type, it's a bit confusing since it doesn't involve an address. The Intel manual uses 'addressing mode' to refer to memory addressing, and I will use this definition.

In assembly, an operand can be :

  • An immediate value
  • A register
  • A value in memory (the operand here is the address)

In the x86 architecture, the "addressing mode" is only for the last type of operands : memory operands (addresses), and refers to the methods available to calculate the addresses. The addressing modes can be summarized in a single configurable addressing mode :

address = REG_base + REG_index*n + offset

REG_base, REG_index, n and offset are all configurable, and can all be omitted (but you need at least one, obviously).

address = offset is called immediate, direct or absolute addressing.
address = REG_base is called register indirect addressing.
address = REG_base + REG_index is called base plus index addressing.
Similarly, you can add an offset (offset) and a scale (n).

Strictly speaking, you only need one mode to do everything : register indirect addressing (address = REG). With that, if you need to access memory, you can calculate any address you want in a register, and use it to do the access. It can also replace direct register operands by using memory instead, and immediate operands by constructing values with arithmetic. However, for a practical instruction set, you would still immediate operands to load addresses effectively, and register operands are needed if you don't want pointer-only registers.

All the other addressing modes beside register indirect are here for convenience, and they are indeed really convenient :

  • Immediate addressing saves you a register if you just have to access a fixed variable in memory.
  • Base + offset is really useful for accessing object members : you can keep the base address in a register and access individual members with a fixed offset. No need for intermediate calculations or register to hold the member address.
  • Similarly, indexed addressing is used for accessing arrays : you can just change an index register to access any value in the array.
  • With a scale you can access multi-bytes variable (ex: int) arrays with no additional registers or calculations.
  • A combination of everything can be used to access an array member in an object, still preserving the base pointer for potential access to other members in the object.

These addressing modes don't need much calculations from the CPU : only additions and shifts. Considering x86 can do a multiplication every cycle, those operations are trivial but still very convenient.

ElderBug
  • 5,926
  • 16
  • 25
  • 1
    In the context of how computer architecture is typically taught, immediate and register are considered addressing modes (even though they do not address *memory*). Also, I do not think any x86 implementation has single cycle *latency* for multiplication; x86 implementations typically pipeline multiplication so a new (independent) multiplication can be started each cycle but there is a difference between do a multiplication every cycle and do a multiplication in a cycle. –  Feb 06 '16 at 01:27
  • @PaulA.Clayton Arguably, modes that don't involve addresses are not considered addressing modes. At least I think they shouldn't be. There is no addresses involved and it creates misunderstandings like this question. And x86 does multiplication in a single cycles ; the other cycles for instruction read, loads and stores aren't for multiplying. This is relevant here because you don't have these costs when doing in-instruction additional calculations, like address computation. – ElderBug Feb 06 '16 at 01:58
  • @ElderBug: Most current x86 CPUs can multiply with one-per-clock throughput, but none of them can do it with one cycle **latency**. If the input of a multiply is the output of the previous multiply, you can only sustain one per 3 clocks (Intel SnB-family). See http://agner.org/optimize/ for insn tables. The lowest-latency x86 multiply in those tables is Via Nano3000 (aka Isaiah), with 2c latency for `mul r8`, or for `imul r32, r32`. Most x86 CPUs can shift in a single cycle, but not shift-and-add. There are obvious code-size/density / register-scarcity benefits to addressing modes as well – Peter Cordes Feb 06 '16 at 07:45
  • @PeterCordes What you're saying isn't wrong, but as I already said, latency is mostly irrelevant when doing additional calculations in an instruction (for simple cases). Most of the latency doesn't come from the calculation itself. What I mean is that you could have an addressing mode with arbitrary multiplication without affecting the latency (or minor effect). Also, most x86 CPUs can actually do simple shift-and-add in a single cycle using `lea` :). – ElderBug Feb 06 '16 at 14:34
  • I meant "without `lea`", which wouldn't do that anyway if there wasn't an addressing mode for it. Re-reading it now, it does look obviously wrong, so what I wrote doesn't match what I meant. :( Re: latency of mul: Yes, the 3c latency *does* come from the calculation itself. `add` has 1c latency: you can `add eax, eax` with a throughput (limited by latency) of 1 `add` per clock, but only `imul eax, eax` with a throughput of one `imul` per 3 clocks, on Intel SnB. And addressing mode like `[reg1*reg2]` would probably add 2 clocks to the latency measured with pointer-chasing, if you used it. – Peter Cordes Feb 06 '16 at 15:01
  • Point is, if multiplication was as low-latency as addition in hardware, CPU vendors would make `imul` faster. It's a common enough instruction, and probably on the critical path in a lot of code. Similarly, vector integer multiply is a lot slower than vector integer add. There are forwarding networks to forward the output of one execution unit directly to the input of another (or itself). Write and re-read of results into the register file is bypassed. Instruction fetch isn't anywhere near being part of `imul` latency unless you're on a really ancient CPU that's not even pipelined. – Peter Cordes Feb 06 '16 at 15:14
  • @PeterCordes I'm pretty sure recent x86 can execute multiplication with low latency. Modern x86 have 4 integer ALU ports, and port 1 seems to includes low latency multiplication. Basic MUL instructions only use ALU port 1, and still have a throughput of 1 per cycle. This wouldn't be possible if port 1 didn't execute MUL in a single cycle, and this means the calculation is indeed the minor part in the latency. Obviously this seems non-trivial, since only 1 ALU port out of 4 includes this, but still. – ElderBug Feb 06 '16 at 15:50
  • @ElderBug: Latency lower than throughput means the execution unit is fully pipelined, *not* that it's low latency but for some reasons sits on its hands for 2 cycles when `add` doesn't. 3c latency gives enough time for all the gate-delays required for the shifts and adds that make up a a hardware multiplier. This is the same as FP add or FP multiply: you need 10 accumulators to saturate both FMA execution ports on Haswell (Broadwell can be weird about that apparently: http://stackoverflow.com/q/34309707/224132.) – Peter Cordes Feb 06 '16 at 16:10
  • @PeterCordes So you're saying there is a gate-level pipelining in addition to the instruction-level pipeline ? That would explain why high-latency instructions are confined to only some ports. But that would mean there is no early termination for simple MUL ? I guess 2 cycles is good enough... – ElderBug Feb 06 '16 at 17:20
  • @ElderBug: 2 cycles *beyond add*. It's 3 cycles on Intel. And yes, computing a single mul takes many stages of feeding the output of one adder into the input of another adder. Even low-latency adders are non-trivial. ([64 bits of simple ripple-carry is slow.](https://en.wikipedia.org/wiki/Adder_(electronics))) This process is synchronized with the system clock at a couple mid-way points, with latches, so that a new multiply can start working its way through the gates. AMD only partially pipelines their mul unit: 4c latency, one per 2c tput. – Peter Cordes Feb 06 '16 at 17:57
  • Other high-latency instructions, like FP add, mul, and FMA, are conceptually identical. Even FP div is partly pipelined on recent Intel CPUs. Intel Skylake manages a *very* impressive `divps` throughput of one per 3c, with 11c latency. (or one per 5c for 256b vectors). Lane-crossing shuffles have 3c latency for similar reasons, it's just that the computation isn't as impressive. Making them higher latency allows implementation with far fewer transistors than doing it with a giant 1-to-N muxer for each bit. (This is a bit hand-wavey; I'm not really a hardware designer.) – Peter Cordes Feb 06 '16 at 18:01
  • And no, mul latency isn't data dependent. Variable latency is only used for very-high-latency ops like div or sqrt, because it's *far* easier for the uop scheduler to avoid creating write-back conflicts with fixed latencies. (If two operations produce a result from the same port in the same cycle, one has to retry or wait. Handling this hazard is non-trivial. SnB-family CPUs could probably do some 3c-latency operations in 2 cycles, but Intel standardizes latencies to make the scheduler lower power / more efficient. Even though it means some insns prob have higher lat than needed.) – Peter Cordes Feb 06 '16 at 18:06
1

You only need one memory addressing mode, and that's what many classic RISC machines do in practice, such as MIPS, SPARC, RISC-V. They choose immediate($register) because there's room for an immediate in fixed-width instructions. (Some like PowerPC have separate instructions for indexed addressing, otherwise you have to do math in registers to compute the address you want.)


x86 can't do much without registers, so I don't think you can get rid of the register "addressing mode". Some very different architectures might not use registers, and just have a stack or memory,memory instructions. IDK how they implement pointers; maybe such architectures can do memory[memory] (C array notation).

Immediate isn't needed for computation to be possible. You can construct any value, using multiple registers. Start with a zero (xor eax, eax), inc it to get a 1, left-shift it to whatever position you want, inc that to set the low bit, left-shift that, etc. So it takes at worst 2*popcount(N) instructions to get N into a register. Note that immediate shift counts won't be available, though, so the obvious method of repeated shifting by one (shl eax, yes there is a separate encoding for shift-by-one, or just use add eax, eax) will just depend on the position of the highest set bit. So log2(N) + popcount(N) for the obvious shift and inc.

Absolute (what you call direct) memory addressing is not the most useful addressing mode. We can emulate it by constructing addresses with a sequence of instructions (see above), and using [register]. If we're trying to cut down, we want to ditch it. As Jester pointed out, keeping absolute addressing as our only form would be terribly inconvenient (or maybe impossible?) to use.

Index is obviously available for performance, not necessity: you can shift and add with separate instructions.

Displacements are also just for performance, so we can get rid of them and force code to add any displacement manually. See the Immediate paragraph for how.


I believe x86 would still be arbitrarily programmable with just register and [register] addressing modes.

With register, [register], and immediate, performance shouldn't be dramatically worse than full x86, although register pressure would be worse since you'd need spare regs to compute addresses more often. As well as of course needing more instructions.

If you were going to have only one memory addressing mode, a more useful choice than [reg] is [reg+disp] like RISCs use. That would sometimes save instructions (and spare registers) by not having to compute an address in another register, and still allow LEA to copy-and-add a constant, although not its current ability to shift-and-add two regs.

For x86's current machine-code format, that takes an extra byte (or 4 bytes) of machine code. Signalling reg direct vs. [reg]-indirect (mem) vs. [reg+disp8] vs. [reg+disp32] uses 2 bits in the ModRM byte.


If implicit access to memory doesn't count as an addressing mode, you can of course emulate [register] with lodsd and stosd, but you wouldn't be able to do atomic read-modify-write operations. That feels like a cheat, though.

There's also the stack (push/pop): I don't know if a stack+registers machine is Turing-complete, but it certainly isn't programmable in the usual sense. Of course, if you modify e/rsp, you can again emulate [register], but with less choice of operand-size than lodsb/w/d/q / stosb/w/d/q.

x86 has quite a lot of space to store things in registers if you include the 16 ymm registers. Although I can't think of a way to move data between integer registers and the high 128b of a ymm without using either memory or immediate operands (for vextractf128), so in practice you have more like sixteen 16B vector-register slots for stashing local state other than the stack. Still, it's limited size, which probably means that 8 GP registers in the 32bit 386 ISA vs. all the integer/mmx/ymm registers in the 64bit AVX2 ISA isn't relevant for whether the machine is turing-complete with only push/pop, registers, and no modification of the stack pointer other than by push/pop.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847