0

Having this simple c:

#define _XOPEN_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <alloca.h>

int main(){
    char *buf = alloca(600);
    snprintf(buf,600,"hi!, %d, %d, %d\n", 1,2,3);
    puts(buf);
}

Generates on $ cc -S -fverbose-asm a.c:
.

text
    .section    .rodata
.LC0:
    .string "hi!, %d, %d, %d\n"
    .text
    .globl  main
    .type   main, @function
main:
    pushq   %rbp    #
    movq    %rsp, %rbp  #,
    subq    $16, %rsp   #,
# a.c:7:    char *buf = alloca(600);
    movl    $16, %eax   #, tmp102
    subq    $1, %rax    #, tmp89
    addq    $608, %rax  #, tmp90
    movl    $16, %ecx   #, tmp103
    movl    $0, %edx    #, tmp93
    divq    %rcx    # tmp103
    imulq   $16, %rax, %rax #, tmp92, tmp94
    subq    %rax, %rsp  # tmp94,
    movq    %rsp, %rax  #, tmp95
    addq    $15, %rax   #, tmp96
    shrq    $4, %rax    #, tmp97
    salq    $4, %rax    #, tmp98
    movq    %rax, -8(%rbp)  # tmp98, buf
# a.c:8:    snprintf(buf,600,"hi!, %d, %d, %d\n", 1,2,3);
   ...

Upon which does gcc decide to number those temporary variables? (tmp102, tmp89, tmp90, ...)?

Also, can someone explain, why alloca uses %rax (addq $608, %rax) for allocated memory instead of %rsp (subq $608, %rsp)? which is what alloca is for (according to man page) : The alloca() function allocates size bytes of space in the stack frame of the caller.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
autistic456
  • 183
  • 1
  • 10
  • You are compiling without optimisations. Certainly, the code is not going to be optimised. See how `rax` is used to hold the number of elements, then multiplied with 16 to hold the number of bytes and then subtracted from `rsp` to actually allocate the data. The numbers of the temporary variables are essentially random and come from whatever number they have in the intermediate representation. – fuz Jun 09 '20 at 20:55
  • Holy crap, `gcc -O0` asm for alloca is hilariously inefficient! It seems to be using `x / 16 * 16` instead of `x & -16` as part of rounding the allocation size up to a multiple of 16 to maintain stack alignment. I guess the canned sequence of GIMPLE code for the builtin function was written that way for some reason, and at `-O0` GCC didn't do constant propagation through it. But anyway, that's why it's using RAX. – Peter Cordes Jun 09 '20 at 20:55
  • @PeterCordes hardly understood a single word. Can you please more elaborate what does gcc bad without the optimisations (I will try to understand), what is rounding operation here, and how should it use the operation `x & -16`. Better a full answer – autistic456 Jun 09 '20 at 21:00
  • @fuz what do you mean by `intermediate representation`? How can have variables intermediate representation, when majority of them is immediate? – autistic456 Jun 09 '20 at 21:02
  • For anyone trying to give answer, the best approach would to explain stepBystep each manipulation with the `%rax` which eventually ends up with `subq %rax, %rsp`. So I can see all operation, the gcc is trying to do to allocate that space on the stack – autistic456 Jun 09 '20 at 21:05
  • 1
    https://en.wikipedia.org/wiki/Intermediate_representation – Nate Eldredge Jun 09 '20 at 21:07
  • @NateEldredge, but that is just surmise, just speculation, that the number is from the "intermediate representation". I could be because of a size of the variable, a value of the variable or anohter fact. You can give link to wikipedia on general concept of "intermediate representation", but that is not answer to my question – autistic456 Jun 09 '20 at 21:12
  • I know it does not answer your question; that is why I posted it as a comment and not as an answer. I simply wanted to help you find some more information about the term "intermediate representation" that Peter used. – Nate Eldredge Jun 09 '20 at 21:12
  • 2
    To see non-terrible asm, look at what happens with optimization enabled. https://godbolt.org/z/JbdKQ_. It still does some rounding to ensure stack alignment before a function call, partly because 600 isn't a multiple of 16, but it does it at runtime instead of optimizing it into code like it would make for `char buf[600]`. (Which you should compare with) – Peter Cordes Jun 09 '20 at 21:12
  • @PeterCordes the reason of `subq $616, %rsp` and not `$600` is because of `16` alignment according to ABI? – autistic456 Jun 09 '20 at 21:19

1 Answers1

1

How can have variables intermediate representation, when majority of them is immediate?

In an SSA (Static Single Assignment) internal representation of the program logic (like GCC's GIMPLE), every temporary value has a separate name. I'd assume the numbers come from auto-numbered SSA variables when there isn't a C variable name directly associated. But I'm not familiar with GCC internals enough to give any more details. If you're really curious, you could always look through the GCC source code yourself. But I'm fairly confident that auto-numbered SSA vars explains it, and makes total sense.

Numeric literals don't actually get any name with -fverbose-asm. e.g. in the optimized GCC output (from Godbolt) we see this as part of putting args in registers:

...
        movl    $3, %r9d        #,
        movl    $2, %r8d        #,
        xorl    %eax, %eax      #
...

re: alloca: It is eventually offsetting RSP, with subq %rax, %rsp, after rounding the allocation size up to a multiple of 16.

This rounding maintains stack alignment. (Please at least try to google it yourself. When you're missing a lot of background knowledge and concepts, you can't expect answers to fully explain everything from the ground up. When you don't understand the details of something, start by searching on technical terms that get used.)

BTW, that's amazingly inefficient asm from gcc -O0! It seems to be using x / 16 * 16 instead of x & 0xFFFF...F0 as part of rounding the allocation size up to a multiple of 16. (If you single-step with a debugger, you can see the sequence of div and imul are doing that.)

I guess the canned sequence of logic for the builtin function was written that way for some reason, and at -O0 GCC didn't do constant propagation through it. But anyway, that's why it's using RAX.

Perhaps the alloca logic is written in GIMPLE, or maybe RTL code that doesn't get expanded until after some transformation passes. That would explain why it's optimized so poorly even though it's all part of a single statement. gcc -O0 is very bad for performance, but a 64-bit div to divide by 16 is very bad, compared to a very cheap and with an immediate operand. It's also very strange to see a multiply by a power of 2 as an immediate operand in asm; in normal cases the compiler would optimize that into a shift.

To see non-terrible asm, look at what happens with optimization enabled, e.g. on Godbolt. See also How to remove "noise" from GCC/clang assembly output?. Then it does just sub $616, %rsp. But then it wastes instructions at runtime aligning a pointer into that space (to guarantee the space will be 16-byte aligned), even though RSP's alignment is statically known after that.

# GCC10.1 -O3 -fverbose-asm with alloca
...
        subq    $616, %rsp           # reserve 600 + 16 bytes
        leaq    15(%rsp), %r12
        andq    $-16, %r12           # get a 16-byte aligned pointer into it
        movq    %r12, %rdi           # save the pointer for later instead of recalc before next call
        call    snprintf        #

Silly compiler, the alignment of %rsp is statically known at that point, no (x+15) & -16 needed. Note that -16 = 0xFFFFFFFFFFFFFFF0 in 64-bit 2's complement, so it's a handy way to express AND masks that clear some low bits.

Removing alloca and using a plain local array gives even simpler code:

# GCC10.1 -O3 with char buf[600]
        subq    $616, %rsp
...
        movq    %rsp, %rdi
...
        call    snprintf        #
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Related: [What do the gcc assembly output labels signify?](https://stackoverflow.com/q/9799676) re: label names for branch targets within functions. – Peter Cordes Jun 09 '20 at 22:11