I decided to toy around with the source code to show you some principles of assembly, and why I dislike your "variable" term usage, plus why Jester politely said "The assignment isn't terribly clear". (it's actually a bit ambiguous, as you can see I opted for different interpretation in my example)
First a source (I named it so_nasm_syntax_equ.asm
) I used:
SECTION .data
message: ; these compile to machine code bytes
DB "You already know what the next", 0Ah
DB "variable will be, don't you?", 0
length EQU ($ - message) ; these don't compile to machine code, they define
length5 EQU (length + 5) ; only constants for assembler during compilation
length5var:
DB length5 ; this will compile as single byte in .data section
length5var2:
DB length+5 ; with value of that constant (plus another 5 here)
; meanwhile "length5var" is another constant for assembler, having as value memory
; address of target location where that byte containing the value will land.
SECTION .text
global _start
_start:
inc byte [length5var] ; one way to add 1 to the length5var
add byte [length5var],1 ; another way to add 1 to the length5var
; one more way to add one to length5var (this time using two instructions)
mov eax,1 ; also demonstrating the aliasing of al/ax/eax/rax
add [length5var],al ; being single register, just of different bit size
; call sys_exit(0) to terminate correctly
mov eax,1
xor ebx,ebx
int 80h
; you can't add to constant anything
; this will try to increment value in memory at address 0x41, leading to crash
inc byte [length5] ; as that memory address doesn't belong to the .data
rett ; warnings test
; vs
ret
To compile it I used (64b "neon" linux distro used):
nasm -w+all so_nasm_syntax_equ.asm -l so_nasm_syntax_equ.lst -f elf32
ld -m elf_i386 so_nasm_syntax_equ.o -o so_nasm_syntax_equ
Output of compilation:
so_nasm_syntax_equ.asm:33: warning: label alone on a line without a colon might be in error
And the listing file produced, which I will finally interleave with comments/explanations:
1 SECTION .data
First number on line is line number.
2 message: ; these compile to machine code bytes
3 00000000 596F7520616C726561- DB "You already know what the next", 0Ah
4 00000009 6479206B6E6F772077-
5 00000012 68617420746865206E-
6 0000001B 6578740A
7 0000001F 7661726961626C6520- DB "variable will be, don't you?", 0
8 00000028 77696C6C2062652C20-
9 00000031 646F6E277420796F75-
10 0000003A 3F00
The 8 digit hexa number after line number is "address" (offset into memory), the following trail of hexa digit pairs are the final machine code, i.e. byte values to be stored in the executable file, later loaded by OS into memory, initializing and preparing environment for it, and finally executing it by jumping to the entry point. The trailing "-" in the byte values just marks the machine code for that line is not finished and continues on the next line.
Note how the line 2 message:
itself didn't produce any machine code. All it does is create symbol message
, which is available to assembler during compilation (or also to linker, when you declare particular symbol as global). The value of symbol message
here is 0x00000000
= memory address offset of the first byte, which has value 0x59
, which is equal to letter 'Y'
in UTF8 encoding (and also in ASCII encoding).
You can't deduct anything else from that message
symbol, no idea how many bytes are defined after it, or that DB
directive was used after it, etc, the message
itself is just like memory address, nothing more. That's why I don't like word "variable" in Assembly, variables for example in C/C++ are much more, not only they point to the first byte of allocated space, but also the compiler is aware of the type of the variable, and total allocated size of it, using that further in expressions. Assembler has none of that, message = 0x00000000
and that's all about it.
11 length EQU ($ - message) ; these don't compile to machine code, they define
12 length5 EQU (length + 5) ; only constants for assembler during compilation
Here I defined two more constants with EQU
directive, now length = 0x3C
and length5 = 0x41
, but they will never reach the binary, they are visible only to the NASM during compilation of remaining lines of this source code.
13 length5var:
14 0000003C 41 DB length5 ; this will compile as single byte in .data section
Here is the length5
constant used to define value of single byte, pointed at by another symbol length5var = 0x0000003C
.
15 length5var2:
16 0000003D 41 DB length+5 ; with value of that constant (plus another 5 here)
Here is another byte defined, this time using constant length
, and arithmetic expression (constant+5), which can be evaluated during compilation, and so it will produce again the same 0x41
value in the executable. Also I defined another label ahead of it, so length5var2
constant is equal to 0x0000003D
.
17 ; meanwhile "length5var" is another constant for assembler, having as value memory
18 ; address of target location where that byte containing the value will land.
19
20 SECTION .text
21 global _start
22 _start:
Here _start
value is defined as offset 0x00000000
into .text
section (the message
has in this listing same offset, but it relates to .data
section and the two will end with different values after OS will load the binary into memory, and relocate it to target address assigned by OS during loading process).
Also _start
is made global
, so the linker can find it in the .o
file and using it during linking process (to mark correct entry point to the app for OS loader).
23
24 00000000 FE05[3C000000] inc byte [length5var] ; one way to add 1 to the length5var
25 00000006 8005[3C000000]01 add byte [length5var],1 ; another way to add 1 to the length5var
These should be self-explaining, just check how the length5var
address 0x0000003C
is part of machine code of the instruction (in it's pristine 0x0000003C
value, the OS will relocate that to correct final address during loading of binary before execution).
26 ; one more way to add one to length5var (this time using two instructions)
27 0000000D B801000000 mov eax,1 ; also demonstrating the aliasing of al/ax/eax/rax
28 00000012 0005[3C000000] add [length5var],al ; being single register, just of different bit size
Here is another way of adding 1 to the memory value, this time using register al
as source of value 1
for addition, and also by using al
register in the add
instruction the assembler is capable to deduct the memory operand size, so I don't have to add byte
ahead of [length5var]
, because al
is of byte size. al
is of course equal to 1
, because I load whole 32 bit eax
with value 1
, and al
is alias of the lowest 8 bits of eax
, which are then equal to value 1
too.
29
30 ; call sys_exit(0) to terminate correctly
31 00000018 B801000000 mov eax,1
32 0000001D 31DB xor ebx,ebx
33 0000001F CD80 int 80h
This will terminate the code, actually the only visible effect from outside (terminating correctly without crash). To see those inc/add
instructions in action you can use debugger and single stepping over them, validating that the memory value went from 0x41
to 0x42
(and from 0x42 to 0x43 with second addition, etc).
34
35 ; you can't add to constant anything
36 ; this will try to increment value in memory at address 0x41, leading to crash
37 00000021 FE0541000000 inc byte [length5] ; as that memory address doesn't belong to the .data
But if you will try to use that length5
constant in the same way, the NASM will just substitute the length5
symbol with the numeric value 0x41
and compile it as inc byte [0x41]
, which is understood as absolute addressing, trying to access memory at absolute address 0x41
(will be not relocated).
Actually this shows the message
and length5
are not equal kind of symbolic constants, the message
compiles to "address 0x00", making the NASM aware of it being linked to the .data
section, and using it together with generating relocation data as needed, while the length5
is just pure numeric value 0x41
. When you use it as an address, it will be not relocated, and absolute address 0x41
will be accessed (would cause crash, if the app wouldn't be already terminated by previous int 80h
).
The values in machine code which are subjected to relocation are marked by []
in the machine code, compare that second inc
machine code with the previous one. The opcode FE05
is same, the encoded value 0x3C
vs 0x41
is different, but those []
marks (which are not part of the machine code in that particular place, just marking that in listing for reader of listing file) means, that the NASM+linker will generate accompanying relocation table for OS, which will know which bytes of code to patch with actual final address after the binary is loaded into memory.
So if you will check this binary disassembled in debugger while it is prepared to be executed, the first inc
opcode FE05[3C000000]
will look like FE05E4900408
(my OS under debugger loaded the binary in such way, that length5var
ended at address 0x80490e4
in memory). The second inc
opcode is still FE0541000000
(no relocation by OS loader done upon this one).
38
39 rett ; warnings test
40 ; vs
41 00000027 C3 ret
This is just the test of the warnings about labels without colons. When used properly, this can help you to catch typos in instructions, while you can differentiate the instructions from labels by always using the colon after labels. So rett:
will not produce warning then, but you see in source that is not instruction, but label.
Without the warnings the rett
is silently turned into label, and if it was typo (instead of ret
instruction), that 0xC3
machine code opcode for ret
instruction will be missing in the code, producing unexpected behaviour of code. The example above does produce only single 0xC3
opcode, for the correct ret
.
And to make this complete tour through basic things of assembly language usage, this is how the executable binary content looks, after applying strip so_nasm_syntax_equ
to remove some useless symbols (debug) info:
$ strip so_nasm_syntax_equ
$ hd -v so_nasm_syntax_equ
00000000 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 |.ELF............|
00000010 02 00 03 00 01 00 00 00 80 80 04 08 34 00 00 00 |............4...|
00000020 00 01 00 00 00 00 00 00 34 00 20 00 02 00 28 00 |........4. ...(.|
00000030 04 00 03 00 01 00 00 00 00 00 00 00 00 80 04 08 |................|
00000040 00 80 04 08 a8 00 00 00 a8 00 00 00 05 00 00 00 |................|
00000050 00 10 00 00 01 00 00 00 a8 00 00 00 a8 90 04 08 |................|
00000060 a8 90 04 08 3e 00 00 00 3e 00 00 00 06 00 00 00 |....>...>.......|
00000070 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000080 fe 05 e4 90 04 08 80 05 e4 90 04 08 01 b8 01 00 |................|
00000090 00 00 00 05 e4 90 04 08 b8 01 00 00 00 31 db cd |.............1..|
000000a0 80 fe 05 41 00 00 00 c3 59 6f 75 20 61 6c 72 65 |...A....You alre|
000000b0 61 64 79 20 6b 6e 6f 77 20 77 68 61 74 20 74 68 |ady know what th|
000000c0 65 20 6e 65 78 74 0a 76 61 72 69 61 62 6c 65 20 |e next.variable |
000000d0 77 69 6c 6c 20 62 65 2c 20 64 6f 6e 27 74 20 79 |will be, don't y|
000000e0 6f 75 3f 00 41 41 00 2e 73 68 73 74 72 74 61 62 |ou?.AA..shstrtab|
000000f0 00 2e 74 65 78 74 00 2e 64 61 74 61 00 00 00 00 |..text..data....|
00000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000120 00 00 00 00 00 00 00 00 0b 00 00 00 01 00 00 00 |................|
00000130 06 00 00 00 80 80 04 08 80 00 00 00 28 00 00 00 |............(...|
00000140 00 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 |................|
00000150 11 00 00 00 01 00 00 00 03 00 00 00 a8 90 04 08 |................|
00000160 a8 00 00 00 3e 00 00 00 00 00 00 00 00 00 00 00 |....>...........|
00000170 04 00 00 00 00 00 00 00 01 00 00 00 03 00 00 00 |................|
00000180 00 00 00 00 00 00 00 00 e6 00 00 00 17 00 00 00 |................|
00000190 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
000001a0
At offset 00000080
you can see the inc
opcode is already relocated by linker to the target address in .data
section. While the other inc
at offset 000000a1
is left intact, still having machine code fe 05 41 00 00 00
.
Looks like this is quite instructive even for me, as I keep mixing up which part of machine code is patched by linker and which by OS during loading of binary (normally you don't need this while programming in assembly, the important part is to understand that those addresses and symbols are compile-time constants, and when you want to use dynamic memory management, you have to write all the code around that, storing/using the memory address values dynamically).