Assembly: What is the purpose of movl data_items(,%edi,4), %eax in this program

Question

This program (from Jonathan Bartlett's Programming From the Ground Up) cycles through all the numbers stored in memory with .long and puts the largest number in the EBX register for viewing when the program completes.

.section .data
data_items:
    .long 3, 67, 34, 222, 45, 75, 54, 34, 44, 33, 22, 11, 66, 0

.section .text
.globl _start

_start:
    movl $0, %edi
    movl data_items (,%edi,4), %eax
    movl %eax, %ebx
start_loop:
    cmpl $0, %eax
    je loop_exit
    incl %edi
    movl data_items (,%edi,4), %eax
    cmpl %ebx, %eax
    jle start_loop
    movl %eax, %ebx
    jmp start_loop
loop_exit:
    movl $1, %eax
    int $0x80

I'm not certain about the purpose of (,%edi,4) in this program. I've read that the commas are for separation, and that the 4 is for reminding our computer that each number in data items is 4 bytes long. Since we've already declared that each number is 4 bytes with .long, why do we need to do it again here? Also, could someone explain in more detail what purpose the two commas serve in this situation?

*why do we need to do it again here?* Because this is assembly language. Everything is just bytes. If you wanted to load 4 bytes that overlap two adjacent array elements, you can, because x86 supports unaligned loads. It's up to you to write each instruction on its own to do exactly what it needs to do. — Peter Cordes, Jan 10 '18 at 02:59
to extend Peter's comment: the machine code and CPU does not care about your source, so the information the bytes were defined with `long` directive is not visible to the machine. both `.long 0x12345678` and `.byte 0x78, 0x56, 0x34, 0x12` will produce the exact same 4 byte long sequence, so during runtime there's no way to tell how those bytes were defined or what was their purpose. That's up to the code which is running, to decide how to access and use them. (if you want "types" while programming, use high level language like C++, assembly has different goals, making machine accessible as-is) — Ped7g, Jan 10 '18 at 10:38
this is still interesting though, i removed the 4 and the commas and the program still works fine, when i remove the parenthesis i get an "invalid operand" error — Katz_Katz_Katz, Jan 26 '18 at 14:29

Matteo Italia · Accepted Answer · 2018-01-10T23:06:56.357

5

In AT&T syntax, memory operands have the following syntax¹:

displacement(base_register, index_register, scale_factor)

The base, index and displacement components can be used in any combination, and every component can be omitted

but obviously the commas must be retained if you omit the base register, otherwise it would be impossible for the assembler to understand which of those components you are leaving out.

All this data gets combined to calculate the address you are specifying, with the following formula:

effective_address = displacement + base_register + index_register*scale_factor

(which incidentally is almost exactly how you would specify this in Intel syntax).

So, armed with this knowledge we can decode your instruction:

movl data_items (,%edi,4), %eax

Matching the syntax above, you see that:

data_items is the displacement;
base_register is omitted, so is not put into the formula above;
%edi is index_register;
4 is scale_factor.

So, you are telling the CPU to move a long from the location data_items+%edi*4 to the register %eax.

The *4 is necessary because each element of your array is 4-bytes wide, so to transform the index (in %edi) to an offset (in bytes) from the start of the array you have to multiply it by 4.

Since we've already declared that each number is 4 bytes with .long, why do we need to do it again here?

Assemblers are low level tools that knows nothing about types.

.long is not an array declaration, is just a directive to the assembler to emit the bytes corresponding to the 32-bit representation of its parameters;
data_items is not an array, is just a symbol that gets resolved to some memory location, exactly as the other labels; the fact that you placed a .long directive after it is of no particular significance to the assembler.

Notes

Technically, there would also be the segment specifier, but given that we are talking about 32 bit code on Linux I'll omit segments entirely, as they would only add confusion.

edited Jan 10 '18 at 23:06

answered Jan 10 '18 at 01:34

Matteo Italia

123,740
17
206
299

`offset_register` is not a very good name. It's actually the index, in Intel terminology. It's the register that's scaled by the scale factor, even if that factor is `1`. `offset` already has a specific technical meaning here (the part of the address that's added to the segment base). – Peter Cordes Jan 10 '18 at 02:53
Perhaps they were trying to avoid "index" because they want to talk about array indexing with base + disp32 addressing modes like `data_items(%edi)`, but that's *not* an "indexed addressing mode" in Intel terminology. There's no SIB byte in the encoding, and the difference [matters for micro-fusion](https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes) on Intel SnB-family CPUs. – Peter Cordes Jan 10 '18 at 02:55
@PeterCordes yep, I was a bit surprised by that choice of names as well, but I was lazy and sticked to it =). When I get home I'll make it conform to the Intel manuals. – Matteo Italia Jan 10 '18 at 20:12
1

Yeah, if I wasn't lazy I'd edit the Wikibook. But basically I try to limit my efforts to correcting mistakes on SO. I'd never have time for anything else (or get burned out faster) if I really tried to fix the whole web. – Peter Cordes Jan 10 '18 at 20:13
2

@PeterCordes: aaah, now it has become a challenge; fixed here, fixed [there](https://en.wikibooks.org/w/index.php?title=X86_Assembly%2FGAS_Syntax&type=revision&diff=3360455&oldid=3313138); hopefully they'll accept the edit. – Matteo Italia Jan 10 '18 at 22:57
I greatly appreciate everyone's help with this, it's a good insight that my reference to "declaring" is something that applies to c/c++ when assigning a variable, this is completely different than assembler directives. So when we use the directive .long or .byte, it initializes the byte/bytes, is it being initialized as an object in this case? – Katz_Katz_Katz Jan 11 '18 at 01:45
@Katz_Katz_Katz: honestly I cannot understand what you mean with your last sentence, can you rephrase it? – Matteo Italia Jan 11 '18 at 08:35
honestly, im having a little bit of trouble understanding "initialization". Wikipedia says "...initialization is the assignment of an initial value for a data object or variable." So, i figured that since we are not assinging variables with .long, then we are creating a data object. Is this the case? Sorry if i was unclear... – Katz_Katz_Katz Jan 11 '18 at 17:53
1

IMO these are just distinctions that create confusion. At assembly level, memory is just bytes. There you simply have reserved a chunk of bytes that the loader makes sure are initialized to those values (=the byte representation of those values) when your program starts, and the symbol `data_items` gets resolved to the address of the start of such block. Then, in your code you can use it to access and manipulate that data however you prefer (including interpreting it as different data types from those you used to initialize it, which are completely accidental). – Matteo Italia Jan 11 '18 at 18:12

Assembly: What is the purpose of movl data_items(,%edi,4), %eax in this program

1 Answers1

Linked

Related