gcc assembly string representation

Question

char a[] = "abc";   // movl    $6513249, -12(%rbp)
char ab[] = "ab";  // movw    $25185, -11(%rbp) 
char abc[] = "a"; // movw    $97, -10(%rbp)

The C code above is represented in assembly (gcc -S code.c) as:

movl    $6513249, -12(%rbp)
movw    $25185, -15(%rbp)
movw    $97, -17(%rbp)

97 is 'a' in decimal, but why "ab" is 25185 and "abc" is 6513249?

I think you omitted `movb $0, -13(%rbp)`, didn't you? It is part of the second string, which is 3 bytes long. — prl, Oct 25 '17 at 21:34

Jean-François Fabre · Accepted Answer · 2017-10-26T07:00:59.423

5

let's take the hex value of the 32-bit integer of the first line:

>>> hex(6513249)
'0x636261'

which is cba

As the processor is little-endian, it's just an optimized way to initialize a small string with just a 32-bit move instead of a tedious byte-by-byte copy.

The nul-termination is not handled for all strings here (movw $25185, -15(%rbp) sets a and b but doesn't nul-terminate), and it's done elsewhere in code you're not showing (note that there's room for the nul-termination byte: first string is at offset -12, second string is at offset -15, which makes it 3 bytes long, and same for the last one)

edited Oct 26 '17 at 07:00

answered Oct 25 '17 at 21:06

Jean-François Fabre

137,073
23
153
219

I also found this link useful: https://stackoverflow.com/questions/19323806/copying-a-4-element-character-array-into-an-integer-in-c – Leandro Oct 25 '17 at 21:10
yeah on a big endian architecture, the values would have been different. Some intel assemblers like MASM use big endian for strings (since it's manual, it's easier _not to have to swap_), which means that the assembled code has the bytes swapped compared to the asm source. Destabilizing... – Jean-François Fabre Oct 25 '17 at 21:12
3

The buffer is automatic, so it is not already zeroed, but 6513249 as a long is 0x00636261, so the null terminator is there. – prl Oct 25 '17 at 21:31
@prl true for that part, but what about the `ab` string where the compiler uses a 16-bit move? – Jean-François Fabre Oct 26 '17 at 05:01
3

The OP didn't show the surrounding code. gcc uses a separate instruction to store the zero instead of overlapping stores or leaving padding out to 4B. https://godbolt.org/g/wJir9S. It's a missed optimization: `push imm32` or `mov r/m64, imm32` could store the 1B and 2B strings with explicit `0` after one and implicit `0` after the other (from sign-extension to 64 bits, because the last character is below 127). The 3-byte string could overlap the last 3 bytes of that store with another 4B store. The stack is 16B-aligned, so you could do this with no chance of crossing a cache line. – Peter Cordes Oct 26 '17 at 06:58
@PeterCordes I'll edit the zero buffer part. I could only assume here (and it was an extension to the answer to the question: damn this godbolt site is cool!) – Jean-François Fabre Oct 26 '17 at 06:59
1

Reported as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729. Silly compilers. (clang didn't do any better, and icc is worse (copying from string literals even for tiny strings.) – Peter Cordes Oct 26 '17 at 08:16

gcc assembly string representation

1 Answers1