Where are the null-terminated strings when converting from C to assembly?

Question

I made two programs to output two strings, one in assembly and the other one in C. This is the program in assembly:

.section .data
string1:
.ascii "Hola\0"
string2:
.ascii "Adios\0"

.section .text
.globl _start
_start:

pushl $string1
call puts
addl $4, %esp

pushl $string2
call puts
addl $4, %esp

movl $1, %eax
movl $0, %ebx
int $0x80

I build the program with

as test.s -o test.o
ld -dynamic-linker /lib/ld-linux.so.2 -o test test.o -lc

And the output is as expected

Hola
Adios

This is the C program:

#include <stdio.h>
int main(void)
{
    puts("Hola");
    puts("Adios");
    return 0;
}

And I get the expected output, but when converting this C program to assembly with gcc -S (OS is Debian 32 bit) the output assembly source code does not include the null character in both strings, as you can see here:

    .file   "testc.c"
    .section    .rodata
.LC0:
    .string "Hola"
.LC1:
    .string "Adios"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    leal    4(%esp), %ecx
    .cfi_def_cfa 1, 0
    andl    $-16, %esp
    pushl   -4(%ecx)
    pushl   %ebp
    .cfi_escape 0x10,0x5,0x2,0x75,0
    movl    %esp, %ebp
    pushl   %ecx
    .cfi_escape 0xf,0x3,0x75,0x7c,0x6
    subl    $4, %esp
    subl    $12, %esp
    pushl   $.LC0
    call    puts
    addl    $16, %esp
    subl    $12, %esp
    pushl   $.LC1
    call    puts
    addl    $16, %esp
    movl    $0, %eax
    movl    -4(%ebp), %ecx
    .cfi_def_cfa 1, 0
    leave
    .cfi_restore 5
    leal    -4(%ecx), %esp
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Debian 4.9.2-10) 4.9.2"
    .section    .note.GNU-stack,"",@progbits

My two questions are:

1) Why the gcc generated assembly code does not append the null character at the end of both strings? I thought that C did this automatically.

2) If I skip the null characters in my hand made assembly code i get this output:

HolaAdios
Adios

I understand why I get the "HolaAdios" part at the first line, but why does the program end successfully after the "Adios" part if it is not null-terminated?

Your assembler code uses the `.ascii` directive for the strings, while the GCC generated code uses `.string`. Read [the GNU AS documentation](https://sourceware.org/binutils/docs/as/) for more information about the directives. — Some programmer dude, Aug 09 '16 at 11:20
As for your second question, think about what the data is after your strings. It could be *anything*. Not having the terminator will simply lead to undefined behavior, and that it seemingly works is just one of the possibilities of UB. — Some programmer dude, Aug 09 '16 at 11:22
Since you're skipping the libc's initialization in you assembly program, calling `puts()` is probably undefined behavior. — EOF, Aug 09 '16 at 11:31
"why does the program end successfully after the "Adios" part if it is not null-terminated?" - **mere luck (could have just as well ended "unsuccessfully")**. — barak manos, Aug 09 '16 at 11:36
Ok, thanks. I didn't even notice the .string directive. Now I see that also the .asciz directive will do the job. — saga.x, Aug 09 '16 at 11:44
@EOF, it's actually fine on Linux; the dynamic linker calls initializer functions, so [you can use libc functions from `_start`](http://stackoverflow.com/questions/36861903/assembling-32-bit-binaries-on-a-64-bit-system-gnu-toolchain/36901649#36901649) without manually calling the init functions if you don't statically link libc. But you're right that it's not a portable practice, and generally not recommended. — Peter Cordes, Aug 09 '16 at 13:07

score 6 · Accepted Answer · answered Aug 09 '16 at 11:33

6

.string always appends a null terminator, as seen here.
Well, you can check it yourself. puts just continues until it sees a null byte. \x00s are very common, there must be one nearby so it works (probably due to section alignment of .rodata).

answered Aug 09 '16 at 11:33

Armitage.apk

121
3

score 0 · Answer 2 · edited May 23 '17 at 12:22

Just to add a bit more detail:

Your second string is zero-terminated by chance, because there's nothing after it in your .data section. You dynamically link glibc, which also has a .data section which gets mapped into your process's address space. It's a private mapping, but I think it is mapped, not copied, so it's page-aligned. The rest of the page holding your executable's data segment is padded with zeros. (The ABI may not guarantee this, but Linux has to do something to avoid leaking kernel data).

When your executable is loaded into memory, the data segment is loaded separately from the text segment. See this answer about the difference between sections (which the linker cares about) and executable segments (which the program loader cares about).

Note that gcc puts string constants in the .rodata section, which the linker places in the text segment of the executable, along with the .text section: read-only so it can be shared between multiple processes running the same executable. Sections are aligned by default with padding, so even if you put your strings in .rodata without zero terminators, there would be a zero of padding after the 2nd.

This wouldn't happen if it happened to end at the right alignment boundary (e.g. length was a multiple of 16, or something).

BTW, you can confirm that there weren't any non-printing garbage characters after the string, using strace ./string-test. You can see: write(1, "Adios\n", 6) = 6

.string is a synonym for .asciz. The manual uses different language to describe the fact that they process backslash escape sequences, and append a zero-byte, but they do the same thing. The GNU assembler has a lot of synonyms for compatibility with many different Unix vendor-supplied assemblers, so it can be confusing to realize there's actually no difference when gcc uses .zero but clang uses .skip, or something like that.

I build the program with...

The commands you used will only work on a 32-bit system. On a 64-bit host, you'd build a 64-bit binary which still uses the 32-bit system call ABI. (And the 32-bit dynamic linker path, so it wouldn't even work by accident, even though static data addresses are in the low 32 bits, so could be passed to the 32-bit wrapper for sys_write.)

Also, I'd recommend calling your source file test.S. capital-S is the usual for hand-written asm source. You can assemble and link with gcc -m32 -nostartfiles test.S -o test to assemble and link the same way as you were doing manually.

See this Q&A for the full details on building asm on Linux: Assembling 32-bit binaries on a 64-bit system (GNU toolchain)

See also the x86 tag wiki for lots of interesting links.

Where are the null-terminated strings when converting from C to assembly?

2 Answers2