12

I have the following C program:

int main()
{
    int c[10] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 2};
    return c[0];
}

and when compiled using the -S directive with gcc I get the following assembly:

    .file   "array.c"
    .text
.globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $0, -48(%rbp)
    movl    $0, -44(%rbp)
    movl    $0, -40(%rbp)
    movl    $0, -36(%rbp)
    movl    $0, -32(%rbp)
    movl    $0, -28(%rbp)
    movl    $0, -24(%rbp)
    movl    $0, -20(%rbp)
    movl    $1, -16(%rbp)
    movl    $2, -12(%rbp)
    movl    -48(%rbp), %eax
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (GNU) 4.4.5 20110214 (Red Hat 4.4.5-6)"
    .section        .note.GNU-stack,"",@progbits

What I do not understand is why are the earlier array elements further from the bp? It almost seems like the elements on the array are being placed in opposite order.

Also why does gcc not use push instead of movl, to push the array elements onto the stack?


DIFFERENT VIEW

Moving the array to global namespace as a static variable to the module I get:

    .file   "array.c"
    .data
    .align 32
    .type   c, @object
    .size   c, 40
c:
    .long   0
    .long   0
    .long   0
    .long   0
    .long   0
    .long   0
    .long   0
    .long   0
    .long   1
    .long   2
    .text
.globl main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    c(%rip), %eax
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (GNU) 4.4.5 20110214 (Red Hat 4.4.5-6)"
    .section    .note.GNU-stack,"",@progbits

Using the following C program:

static int c[10] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 2};

int main() 
{
    return c[0];
}

This doesn't give more insight to the stack. But it is intersting to see the differement output of assembly using slightly different semantics.

Matthew Hoggan
  • 7,402
  • 16
  • 75
  • 140
  • That is code generated with the gcc command, often times the compiler generates different code that does the same thing. Basically, there are multiple solutions to a problem. Did you optimize with -o when compiling? It could be because of that (or lack of). – swiftcode Dec 12 '11 at 17:50
  • This is actually code generated by the compiler, and it cannot be different from the code generated by the compiler. Optimization flag is `-O` and not `-o`, and yes Matthew used `-Os` as he clearly stated in his question.. –  Dec 12 '11 at 17:52
  • He said `-S` (generate assembly as text), not `-Os` (optimize for size). This assembly appears to have been generated *without* optimization; at any level of `-O` I see the array stores vanish entirely. – zwol Dec 12 '11 at 18:04

3 Answers3

9

First of all, the x86 stack grows downwards. By convention, rbp stores the original value of rsp. Therefore, the function's arguments reside at positive offsets relative to rbp, and its automatic variables reside at negative offsets. The first element of an automatic array has a lower address than all other elements, and thus is the furthest away from rbp.

Here is a handy diagram that appears on this page:

stack layout

I see no reason why the compiler couldn't use a series of push instructions to initialize your array. Whether this would be a good idea, I am not sure.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • This diagram would normally be flipped upside down no? Since high memory can be thought of as being at the top of the chip? – Matthew Hoggan Dec 12 '11 at 18:10
  • 1
    @MatthewHoggan: Perhaps. In terms of clarity, I personally don't have a strong preference either way. – NPE Dec 12 '11 at 18:12
  • @MatthewHoggan Maps of a region of memory are invariably drawn with higher addresses toward the top of the page. Diagrams of data structures, network packets, and the like, however, are often drawn with larger offsets toward the bottom of the page. – zwol Dec 12 '11 at 18:22
  • @Zack: A long time ago, I read that little-endian ordering favours graphical diagrams as the above, while big-endian would favor left-to-right top-to-bottom representations. – ninjalj Dec 12 '11 at 20:35
  • @ninjalj IME endianness sometimes affects whether people draw high *bits* within a word on the left or the right, but I have never seen it change whether high addresses are on top or on the bottom. – zwol Dec 12 '11 at 20:45
  • @Zack: the idea is that if you have e.g: a DWORD split in two parts in a WORD-sized diagram, the contents read OK. – ninjalj Dec 12 '11 at 20:54
3

Also why does gcc not use push instead of movl, to push the array elements onto the stack?

It is quite rare to have a large initialized array in exactly the right place in the stack frame that you could use a sequence of pushes, so gcc has not been taught to do that. (In more detail: array initialization is handled as a block memory copy, which is emitted as either a sequence of move instructions or a call to memcpy, depending on how big it would be. The code that decides what to emit doesn't know where in memory the block is going, so it doesn't know if it could use push instead.)

Also, movl is faster. Specifically, push does an implicit read-modify-write of %esp, and therefore a sequence of pushes must execute in order. movl to independent addresses, by contrast, can execute in parallel. So by using a sequence of movls rather than pushes, gcc offers the CPU more instruction-level parallelism to take advantage of.

Note that if I compile your code with any level of optimization activated, the array vanishes altogether! Here's -O1 (this is the result of running objdump -dr on an object file, rather than -S output, so you can see the actual machine code)

0000000000000000 <main>:
   0:   b8 00 00 00 00          mov    $0x0,%eax
   5:   c3                      retq   

and -Os:

0000000000000000 <main>:
   0:   31 c0                   xor    %eax,%eax
   2:   c3                      retq   

Doing nothing is always faster than doing something. Clearing a register with xor is two bytes instead of five, but has a formal data dependence on the old contents of the register and modifies the condition codes, so might be slower and is thus only chosen when optimizing for size.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • mmm, I would expect xor'ing a register with itself to be special-cased, but yeah, it _might_ be slower. – ninjalj Dec 12 '11 at 20:37
  • 1
    I don't have any x86 optimization guides to hand, but my recollection is that some models do special case that operation and some don't. If you don't tell it otherwise, GCC tries to generate code that is reasonably well tuned on a wide range of CPUs, although not perfectly tuned for any specific one. – zwol Dec 12 '11 at 20:42
  • I would suppose that it is not taught to do so *because* it would be slower. It would be quite easy to make a difference here, but as it would be inefficient, it is not done. – glglgl Dec 12 '11 at 20:51
  • `push` isn't slower anymore; Intel since Pentium-M (and AMD since a similar time) [have a "stack engine" that breaks the dependency chain through RSP](https://stackoverflow.com/questions/36631576/what-is-the-stack-engine-in-the-sandybridge-microarchitecture) and make push/pop single-uop instructions. It would be a win over mov-zero ([What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?](//stackoverflow.com/q/49485395)), also for code size (push imm8 vs. REX movq imm32), but pxor-zero / movdqa 16-byte stores are even better here. – Peter Cordes Dec 16 '21 at 05:56
  • `xor`-zeroing has been special-cased since before(?) PPro ([What is the best way to set a register to zero in x86 assembly: xor, mov or and?](//stackoverflow.com/q/33666617)), although it actually did still have a false dependency up through PIII. The reason you don't get xor-zeroing with `-O1` is that `-fpeephole2` is only enabled at `-O2` and higher. `-O1` is partial optimization; `-Os` is full optimization for size (and speed); `-O3` is full optimization for speed. (I'm sure you figured out most of these things in the 10 years since you posted, but this answer has some misinformation :/) – Peter Cordes Dec 16 '21 at 05:57
  • Also `push` would push two array elements per instruction, vs. GCC's unoptimized asm naively storing 4 bytes at a time. Modern GCC -O0 at least uses `movq`, while `-O3` uses 2x `movaps` + `movq`-immediate. (If you pass the array to a caller so the zeroing can't optimize away). https://godbolt.org/z/nfs9sfsj6 – Peter Cordes Dec 16 '21 at 06:01
2

Keep in mind that on x86 the stack grows downward. Pushing onto the stack will subtract from the stack pointer.

%rbp <-- Highest memory address
-12
-16
-20
-24
-28
-32
-36
-40
-44
-48  <-- Address of array
Dark Falcon
  • 43,592
  • 5
  • 83
  • 98