Why mov ax, [[num] + val] isn't the same as breaking it to instructions

Question

I need to use a pointer to an array and put the third value in ax. My solution:

mov bx, [chrs_ptr]
add bx, 2
mov ax,[bx]

But I couldn't figure out why mov ax, [[chrs_ptr] + 2] gives me the pointer value.

Is that even a valid assembly instruction? I don't think []s can nest... — user202729, Aug 16 '19 at 10:13
Also note that if your array contains words, the third value is at offset 4 not 2 as each item is 2 bytes. If your array contains bytes, then offset 2 is fine but then don't load a word. — Jester, Aug 16 '19 at 10:24
@Jester The array is bytes but I eventually print al so I think that's fine — Adam Katav, Aug 16 '19 at 10:48
@AdamKatav What are you using to assemble this, and what are the produced opcodes? If I had to guess, I'd assume that the assembler is dropping on of the pairs of brackets. Do you get assembly warnings? — Thomas Jager, Aug 16 '19 at 11:11

score 7 · Accepted Answer · answered Aug 16 '19 at 10:44

Because assembler is different to most of the other programming languages.

Most of the PLs are aiming for generalization and universal usage of their syntax, as building blocks, to compose expressions way beyond trivial syntax of particular single feature.

Assembler is "human readable" mnemonics for the machine code of the particular processor. I.e. the instructions available in particular assembler are not designed by the assembler creator (mostly), but they are defined by the HW engineers designing the CPU itself. So instructions of your assembler depends on the target processor, and they are like 1:1 mapped from the source form into actual CPU instruction (with few exceptions, some assemblers actually do have support for few pseudo-instructions, but generally it's 1:1).

And thus emu8086 emulating 8086 CPU (and also few following ones from that family, 80286 and maybe even 80386? I'm never sure what emu8086 supports, as I don't use it) has only the Intel instructions designed in those processors.

There is for example MOV r16,r/m16 instruction in 16 bit real mode, which you are using at line mov ax,[bx], but there is no instruction like MOV r16,memory-by-indirection-from-other-memory with +2 offset, so when programming in assembly, you are expected to know the target instruction set, and write your solution with instructions which are available.

You may question why some instruction does only "that", but not also "this", which may seem to you like obvious thing to do, but most of the times the ISA (Instruction Set Architecture) tend to be very good compromise between what is enough to write practical code with, and what is practical to design in the circuit, with reasonable power consumption and timing features per single machine cycle, and also with early CPUs what was practical to design by hand, and validate its correctness in head.

Your proposal of mov ax,[[chrs_ptr]+2] would require two reads of memory within single instruction execution, and reading of memory is far from trivial task (often stalls the CPU for multiple machine cycles, until the memory chip is ready to deliver the value from particular cell), so you would immediately raise the complexity of CPU circuitry design a lot. Like a lot.

Or maybe you may question why the assembler doesn't break it down into the three native instructions of processor for you.

But usually when you need assembler, you need actually that 1:1 mapping of assembler source instruction to native machine code instruction of the target CPU. Because if you don't need that, then why assembler at all, there are many higher level languages, which give you still enough low-level control over the machine, but lifts up this kind of micro management of everything, like C or C++. So there's very little incentive to enrich the assembler "language" itself, and usually when you see some effort like this, it ends up as just another programming language, too tainted to be used in place of actual assembler, when you really need to write those few CPU instructions by hand, and tune it down to every single byte.

*you would immediately raise the complexity of CPU circuitry design a lot*. That argument breaks down when you consider the 8086 instruction `cmpsb` which compares two memory operands. Really this is the same instruction-encoding limitation as [Why isn't movl from memory to memory allowed?](//stackoverflow.com/q/33794169) - a ModRM byte can't encoded it in machine code. Some ISAs do have memory-indirect addressing modes, e.g. early machines with small regs like PDP-8 had to keep wider pointers in mem. https://en.wikipedia.org/wiki/Addressing_mode#Memory_indirect — Peter Cordes, Aug 16 '19 at 11:04
`cmpsb` does read at least two independent values (and `movsb` also goes to memory controller twice, and even the write depends on the value read). Then again I guess those string instructions are implemented in quite few gates, compared to other parts of CPU... And yes, there are many different designs, like for example Commodore 64's 6502 using first 256 bytes of memory as special variables-like area. So yes, it's a bit too inaccurate wording and argument, let's say the Intel engineers just didn't find that design attractive enough to make it happen. — Ped7g, Aug 16 '19 at 11:18
The difference is that decoding `cmpsb` doesn't require decoding two ModRM bytes: ModRM decoding is probably not internally microcoded so supporting more complex addressing modes *would* increase complexity.. That's one reason why string instructions are implicitly DS:SI and/or ES:DI, no explicit addr mode to decode. 8086 presumably implemented them with microcode making it pretty cheap. IMO the big price for memory-indirect addressing modes would be code-size, as well as decode complexity. One ModRM byte is already very limited in addr modes; maybe you'd drop `disp8` to gain a "mode". — Peter Cordes, Aug 16 '19 at 11:34

Peter Cordes · Answer 2 · 2019-08-16T14:37:02.187

There's no way to encode a memory-indirect addressing mode in 8086 machine code. The ModRM byte for 16-bit addressing modes can only encode a very few possibilities: any subset of [bx|bp + si|di + disp0/8/16].

It's basically the same reason you can't mov word ptr [si], [di] (Why isn't movl from memory to memory allowed?)

The limitations in asm source come from machine code, unlike in a high level language where you can just invent syntax that compiles to more instructions if necessary.

I think you're saying that emu8086's assembler really does accept mov ax,[[chrs_ptr]+2] without error. That's because EMU8086's built-in assembler is bad and doesn't always report syntax errors for broken code.

Presumably it assembles it the same as mov ax, [chrs_ptr+2], ignoring the extra square brackets. Actually since it uses MASM/TASM syntax, it's also the same as mov ax, chrs_ptr+2

Unfortunately even MASM/TASM don't warn about mov ax, [[chrs_ptr] + 2]. As Ross Ridge points out, square brackets don't mean a memory ref unless there's a register name inside square brackets. Brackets are just ignored otherwise. See Confusing brackets in MASM32

IIRC, emu8086's assembler has other misfeatures like add [bx], 1 assuming an operand-size of "word" or byte, I forget which, instead of erroring on ambiguous operand size.

It's basically terrible, avoid it if you possibly can.

Or if not, check your code for syntax errors with another assembler like MASM or TASM that will warn instead of just assembling impossible code into machine code that does something. (Except in this case where even they fail to disallow this confusing syntax.)

Some people like MASM-style syntax for other reasons, but I would not recommend it. NASM is nice, especially if you don't really care about obsolete 16-bit segmented development outside of simple bootloaders and .com programs.

Also good assemblers have always support for listing files, which I use often to validate my source in case I have suspicion something is not assembled precisely as I expected, or when I have some assumptions about how the machine code will look, and I need to verify those. (another option is to check in debugger, but listing files work in some cases better). That said, emu8086 didn't have listing files for long time IIRC, was added just few years back in some recent version?? — Ped7g, Aug 16 '19 at 11:21
@Ped7g. I've never used emu8086, just read about it on SO. (Or did I run it once to try something? I forget.) Anyway, looking at a listing of hex machine code + source only helps if you know machine code well enough to realize that `add [si], 1` is using the wrong operand-size and you forgot a `word ptr` or `byte ptr`. Easier with `mov` if you know that `mov r/m16, imm8` doesn't exist. When I want to validate source, I look at disassembly, not listings. (`objdump` or in a debugger, and/or `objconv` or ndisasm if I don't trust GNU binutils like for x87 `fsubr` AT&T bug in Intel mode) — Peter Cordes, Aug 16 '19 at 11:38
@RossRidge Are you sure? The `chrs_ptr` in `mov bx, [chrs_ptr]` isn't a register. — user202729, Aug 16 '19 at 16:12
@user202729 `chrs_ptr` is presumably a label though, so MASM uses it as memory reference whether or not you put it in brackets. The statement `mov bx, [chrs_ptr]` is the same as `mov bx, chrs_ptr`. https://stackoverflow.com/questions/25129743/confusing-brackets-in-masm32/25130189#25130189 — Ross Ridge, Aug 16 '19 at 19:34

Why mov ax, [[num] + val] isn't the same as breaking it to instructions

2 Answers2