Displaying all ascii characters in linux console (NASM assembly)

Question

I read a tutorial on nasm and there is a code example which displays the entire ascii character set. I understand pretty much everything except why are we pushing ecx and popping ecx as I dont see how it relates to the rest of the code. Ecx has the value of 256 since we want all chars but no idea where and hows its used. Wht exactly is happening when we push and pop ecx? Why are we moving the address of achar to dx? I dont see us using dx for anything. I understand that we need to increment the adress of achar but im confused how the increment relates to ecx and dx. I would appreciate some insight.

   section  .text
       global _start        ;must be declared for using gcc

    _start:                 ;tell linker entry point
       call    display
       mov  eax,1           ;system call number (sys_exit)
       int  0x80            ;call kernel

    display:
       mov    ecx, 256

    next:
       push    ecx
       mov     eax, 4
       mov     ebx, 1
       mov     ecx, achar
       mov     edx, 1
       int     80h

       pop     ecx  
       mov  dx, [achar]
       cmp  byte [achar], 0dh
       inc  byte [achar]
       loop    next
       ret

    section .data
    achar db '0'

`ecx` is used as the loop counter, and since it's loaded with `achar` for the system call, its value needs to be preserved. `push`/`pop` is a way to do that. As for `mov dx, [achar]` that indeed seems unneeded. — Jester, Jan 01 '18 at 01:57
That `dx` instruction is completely unnecessary and so out of place. That is a 16-bit register. Everything else is a 32-bit register. Heh, I found the tutorial you were using, it is at https://www.tutorialspoint.com/assembly_programming/assembly_procedures.htm. Tutorials Point tutorials are written by volunteers and weird things like this are not uncommon. — Ray Toal, Jan 01 '18 at 02:05
It's counting down. Consult a reference about what the `loop` instruction does. TL;DR: it's basically `dec ecx; jnz` — Jester, Jan 01 '18 at 02:06
Not only that, ASCII has only 128 characters, not 256, and don't get me started on the indentation..... — Ray Toal, Jan 01 '18 at 02:06
@RayToal what are they cmp the bytes for? Seems redundant for me. — Asperger, Jan 01 '18 at 02:10
In the online compiler pf tutorialpoint it doesnt even compile. Code seems wrong. — Asperger, Jan 01 '18 at 02:13
Never heard of "Extended ASCII" someone just made that up. Or it is some vendor specific thing. Anyway, I agree with you that the `cmp` is meaningless. It might be a good exercise to write this program yourself; I think the author was working on a different approach and left some cruft in there. — Ray Toal, Jan 01 '18 at 02:14
@Jester thanks im doing my reading here now http://asmtutor.com/#lesson4 since tutorialpoint seems unreliable — Asperger, Jan 01 '18 at 02:14
@RayToal at least things make sense now. Unfortunately it doesnt compile but i think its due to a bug in the tutorialpoint online compiler. Other examples work — Asperger, Jan 01 '18 at 02:20
Tutorialspoint definitely has some bogus stuff. There's *some* good stuff, but unless you already know the subject, you can't always tell the difference between a good and bad tutorial. (Especially when it appears decently written, but has some misconceptions or poor ways of doing things.) — Peter Cordes, Jan 01 '18 at 02:52
@Asperger - it's always good practice to output a final `'\n'`. (to not mess up your next prompt -- and to provide a POSIX compliant program). Just load `0xa` as the final character you print before calling exit. — David C. Rankin, Jan 01 '18 at 03:01
@DavidC.Rankin thank you very much. I will do that. Also a terminatimg 0h right? — Asperger, Jan 01 '18 at 03:35
No, no need for a *nul-terminating* character -- it's not a C-string. — David C. Rankin, Jan 01 '18 at 03:46
I believe the author chickened-out after realizing the output does not resemble the DOS 16b version enough, not printing the special box-drawing characters, etc... and the unfinished work-in-progress code somehow got published as tutorial, creating a bit of "trap" for students. Well, at least kudos for you to making it out and asking valid questions about it. :) — Ped7g, Jan 01 '18 at 11:19
@Ped7g took me a while to get out of that trap haha. The official documentation is suprisingly good though. Compared to higher level languages its pure gold. I just noticed after this post. — Asperger, Jan 01 '18 at 15:31

Ped7g · Accepted Answer · 2018-01-02T09:29:57.313

I understand pretty much everything

Well, then you are sort of quite ahead of me... (although from your further comments you become aware of some other non-sense things in that code :) ).

why are we pushing ecx and popping ecx as I dont see how it relates to the rest of the code. Ecx has the value of 256 since we want all chars but no idea where and hows its used.

It is used by LOOP instruction (which is not a good idea: Why is the loop instruction slow?), it will decrement ecx, and jump when value is above zero, i.e. it's a count-down loop mechanism.

As the int 0x80 service call needs ecx for memory address value, the counter is saved/restored by push/pop around that. A more performant way would be to put counter value into some spare register like for example esi, and do dec esi jnz next. Even more performant way would be to re-use the character value itself, if the output would start with zero value, and not zero digit, then the zero flag after inc byte [achar] can be used to detect looping condition.

achar db '0'

It's not clear to me, why "display all ASCII characters" starts at digit zero (value 48), seems weird to me, I would start at zero. But that has another caveat, linux console I/O encoding is set by environment, and on any common linux installation it is UTF8 nowadays, so the valid printable single-byte characters are only of values 32-126 (which are identical to ordinary 7 bit ASCII encoding, making this part of example work well), and values 0-31 and 127 are non-printable control characters, also identical to common 7b ASCII encoding. Values 128-255 indicate in UTF8-encoding multi-byte character (example: ř is two bytes 0xC5 0x99), and as single bytes they are invalid byte sequence, because the remaining part of UTF8 "code point" bytes is missing.

In the age of DOS you could have wrote code writing directly into VGA text-mode video memory full 8 bit values going from zero to 255, and each has distinct graphical representation, you could specify in VGA custom font or known code-page for particular characters, this is also sometimes referred to as "extended ASCII", but the common DOS installation had different ones from the link in your comments, having many more box-drawing characters. This included \r and \n control characters, which are for VGA just another font glyph, not line-feed and new-line control chars (that meaning is created by BIOS/DOS service call, which instead of outputting \n character will move the internal cursor to next line and discard the char from output).

It's impossible to re-create this with linux console I/O (unless the UTF8 font contains all the weird DOS glyphs, and you would output their correct UTF8 encoding instead of single byte values).

Conclusion is, that the example starts with value '0' (48), and up till value 126 it outputs correct printable ASCII characters, after 126 it outputs "something", and as those bytes will sometimes form invalid UTF8 encodings, I would technically call it "bogus" output with undefined behaviour, you can get probably different results for different linux versions and console settings.

Also NASM-style notice: put colon after labels, i.e. achar: db '0', that will save you when you use instruction mnemonics as label by accident, like loop: or dec: db 'd'.

   mov  dx, [achar]

The dx is not used any further, so this is useless instruction.

   cmp  byte [achar], 0dh

Flags from this compare are not used any further either, so this is also useless.

So the adjusted example can look like this:

section  .text
    global _start       ;must be declared for using gcc

_start:                 ;tell linker entry point
    call    display
    mov     eax,1       ;system call number (sys_exit)
    int     0x80        ;call kernel

; displays all valid printable ASCII characters (32-126), and new-line after.
display:
    mov     byte [achar], ' '   ; first valid printable ASCII
next:
    mov     eax, 4
    mov     ebx, 1
    mov     ecx, achar
    mov     edx, 1
    int     0x80
    inc     byte [achar]
    cmp     byte [achar], 126
    jbe     next        ; repeat until all chars are printed
    ; that will output all 32..126 printable ASCII characters

    ; display one more character, new line (reuse of registers)
    mov     byte [achar], `\n`  ; NASM uses backticks for C-like meta chars
    mov     eax, 4      ; ebx, ecx and edx are already set from loop above
    int     0x80
    ret

section .bss
achar: resb 1           ; reserve one byte for character output

But it would make more sense to prepare whole output in memory first, and output it in one go, like this one:

section  .text
    global _start       ;makes symbol "_start" global (visible for linker)

_start:                 ;linker's default entry point
    call    display
    mov     eax,1       ;system call number (sys_exit)
    int     0x80        ;call kernel

; displays all valid printable ASCII characters (32-126), and new-line after.
display:
    ; prepare in memory string with all ASCII chars and new-line
    mov     al,' '      ; first valid printable ASCII
    mov     edi, allAsciiChars
    mov     ecx, edi    ; this address will be used also for "write" int 0x80
nextChar:
    mov     [edi], al
    inc     edi
    inc     al
    cmp     al, 126
    jbe     nextChar
    ; add one more new line at end
    mov     byte [edi], `\n`
    ; display the prepared "string" in one "write" call
    mov     eax, 4      ; sys_write, ecx is already set
    mov     ebx, 1      ; file descriptor STDOUT
    lea     edx, [edi+1]; edx = edi+1 (memory address beyond last char)
    sub     edx, ecx    ; edx = length of generated string
    int     0x80
    ret

section .bss
allAsciiChars: resb 126-' '+1+1 ; reserve space for ASCII characters and \n

All examples were tried with nasm 2.11.08 on 64b linux ("KDE neon" distro based on Ubuntu 16.04), and built by commands:

nasm -f elf32 -F dwarf -g test.asm -l test.lst -w+all
ld -m elf_i386 -o test test.o

with output:

$ ./test
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

May I ask why some registers you chose are 16 bit and others 32 bit registers versions? Is this some sort of memory or performance optimisation? — Asperger, Jan 01 '18 at 15:57
Oh one other question. You move edi inside the ecx register. You are passing the value right since edi is not inside square brackets? Refering to the second code whereyou pass everything in memory first — Asperger, Jan 01 '18 at 16:00
@Asperger I'm not using 16 bit registers anywhere. `al` is lowest 8 bit part of `rax` (`eax`, `ax`, `ah:al`). I'm using the `al` to work with single byte elements (8 bits), as the 7 bit ASCII is used like that, with top 8th bit set to zero (i.e. values `0x00..0x7F`, if you can read hexadecimal in head and convert to bits, you can easily see the top 8th `0x80` bit is clear). It is decided by the data structure (ASCII encoded "string"), if I would be preparing array of 16 or 32 bit elements, I would use rather 16b `ax` or 32b `eax`, etc.. although you can build from bytes anything more complex. — Ped7g, Jan 01 '18 at 18:35
@Asperger the `mov edi, allAsciiChars` loads register `edi` with 32 bit value which represents memory address of reserved buffer in `.bss` section. I reserve there `96` bytes, and the first of them is at address `allAsciiChars`. Memory address in 32 bit mode is 32 bit integer value, like `1234`. Then by `mov ecx, edi` I copy this address value into `ecx` register. If I would do `mov ecx, [edi]`, I would load first 4 bytes of memory content from that buffer, which contains zeroes at that time, as `bss` section is zeroed by linux OS upon loading+initializing the binary, before starting the code. — Ped7g, Jan 01 '18 at 18:39
Where did you get the 96 bytes from? The reserved buffer is 126 bits or 15 bytes? Not sure if I understand. — Asperger, Jan 01 '18 at 19:06
@Asperger `resb 126-' '+1+1` is "reserve bytes", and the amount of reserved bytes is the expression. The expression will evaluate during assembling as `(126-32+1+1)` (space character in ASCII encoding has value `32`), and that is equal to `96`. But by using the ASCII (last-first+1) characters I avoided to do the math myself during writing source, and let the assembler to calculate it. (like `('Z'-'A'+1) == 26`, because English alphabet has 26 letters). The second `+1` is there to reserve space for the additional newline character. And 1 byte = 8 bits, so the buffer is also 768 bits long. — Ped7g, Jan 01 '18 at 19:17
Wow you are really great at explaining things. You cant imagine how thankful I am for your detailed answers. Assembly is really fun but challenging Do you happen to be at CodeMentor.io? — Asperger, Jan 01 '18 at 19:42

Displaying all ascii characters in linux console (NASM assembly)

1 Answers1