Significance of laying out stack variables starting nearer rsp than rbp

Question

This question is about x86 assembly but I provide an example in C because I tried to check what GCC was doing.

As I was following various assembly guides, I have noticed that people, at least the few whose materials I have been reading, seem to be in a habit of allocating stack variables closer to rsp than rbp.

I then checked what GCC would do and it seems to be the same.

In the disassembly below, first 0x10 bytes are reserved and then the result of calling leaf goes via eax to rbp-0xc and the constant value 2 goes to rbp-0x8, leaving room between rbp-0x8 and rbp for variable "q".

I could imagine doing it in the other direction, first assigning to an address at rbp and then at rbp-0x4, i.e. doing it in the direction of rbp to rsp, then leaving some space between rbp-0x8 and rsp for "q".

What I am not sure about is whether what I am observing is as things should be because of some architectural constraints that I better be aware of and adhere to or is it purely an artifact of this particular implementation and a manifestation of habits of the people whose code I read that I should not assign any significance to, e.g. this needs to be done in one direction or the other and it does not matter which one as long it is consistent.

Or perhaps I am just reading and writing trivial code for now and this will go both ways as I get to something more substantial in some time?

I would just like to know how I should go about it in my own assembly code.

All of this is on Linux 64-bit, GCC version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04). Thanks.

00000000000005fa <leaf>:
 5fa:   55                      push   rbp
 5fb:   48 89 e5                mov    rbp,rsp
 5fe:   b8 01 00 00 00          mov    eax,0x1
 603:   5d                      pop    rbp
 604:   c3                      ret    

0000000000000605 <myfunc>:
 605:   55                      push   rbp
 606:   48 89 e5                mov    rbp,rsp
 609:   48 83 ec 10             sub    rsp,0x10
 60d:   b8 00 00 00 00          mov    eax,0x0
 612:   e8 e3 ff ff ff          call   5fa <leaf>
 617:   89 45 f4                mov    DWORD PTR [rbp-0xc],eax   ; // <--- This line
 61a:   c7 45 f8 02 00 00 00    mov    DWORD PTR [rbp-0x8],0x2   ; // <--  And this too
 621:   8b 55 f4                mov    edx,DWORD PTR [rbp-0xc]
 624:   8b 45 f8                mov    eax,DWORD PTR [rbp-0x8]
 627:   01 d0                   add    eax,edx
 629:   89 45 fc                mov    DWORD PTR [rbp-0x4],eax
 62c:   8b 45 fc                mov    eax,DWORD PTR [rbp-0x4]
 62f:   c9                      leave  
 630:   c3                      ret

Here is the C code:

int leaf() {
   return 1;
}

int myfunc() {
   int x = leaf(); // <--- This line
   int y = 2;      // <--  And this too
   int q = x + y;
   return q;
}

int main(int argc, char *argv[]) {
   return myfunc();
}

How I compile it:

gcc -O0 main.c -o main.bin

How I disassemble it:

objdump -d -j .text -M intel main.bin

Yes, you can use your locals however you like. If you allocated more space due to alignment, you may put the padding anywhere. PS: you are looking at unoptimized code which is generally a bad idea. — Jester, Sep 09 '20 at 18:05
@Jester Thanks but I am not sure what is bad in looking at unoptimised code in this context? On -O2 gcc emits "nop WORD PTR cs:[rax+rax*1+0x0]", not using stack variables at all, which is not really what I was trying to exhibit. I broadly understand the difference between optimisation levels but I am not clear what is the caveat in this example with using -O0 given the fact that I am writing assembly and C was just an additional asset? — , Sep 09 '20 at 18:10
You claim that GCC leaves some space below the saved RBP, but actually `dword [rbp-0x4]` is used. (For `q` it looks like.) — Peter Cordes, Sep 09 '20 at 18:12
That was not a "claim" :-) It did not seem relevant to x and y simply. But you are right that I forgot about q and it may seem that I meant it that unused space was left around. I will edit to make it more clearer that it is for q. Thanks. — , Sep 09 '20 at 18:15
`-O0` means compile quickly without trying to optimize (including not trying to optimize stack-frame layout). So if you're hoping to learn anything about how to lay out locals, it's not a good start. (But like Jester said; it doesn't matter how you lay them out, except maybe grouping them so you can init two at once with a single qword store). — Peter Cordes, Sep 09 '20 at 18:17
`-O0` also mean make sure every C object has a memory address for consistent debugging. As you can see from optimized output, there is no need to store or reload anything to stack memory, especially when inlining `leaf`. That *is* what you should try to do when writing by hand. You can use `volatile` locals if you really want to see stack allocation in an otherwise optimized build. Or design test-cases where the compiler can't optimize it away, like passing the address of a local to a non-inline function. (e.g. only give a prototype, not a definition). — Peter Cordes, Sep 09 '20 at 18:18
I am really doing it in assembly and learn by reading other people's assembly code. The example in C was only to check what GCC would do, I am not using -O0 to learn how to write optimised code. I used it only because otherwise GCC would emit completely different machine code which would not serve as an illustration to my question. — , Sep 09 '20 at 18:20
I.e. this is not about checking the output of what GCC produces on `-O0` that much. I was just looking at other people's assembly code and they were using this convention, this order of allocations. Then GCC did the same and I thought "well, maybe there is a rule about it that I should learn and keep using for such and such reasons", which led me to posting this question. — , Sep 09 '20 at 18:35
Don't panic about 1 random drive-by downvote. It doesn't surprise me for a question this long where it's somewhat hard to see exactly what you're asking. Or maybe someone was just in a bad mood and hit the downvote button. — Peter Cordes, Sep 09 '20 at 19:27

Peter Cordes · Accepted Answer · 2020-09-09T19:55:59.353

It makes zero difference, do whichever you want for local variables that have to exist at all (because you can't optimize them into registers).

There is zero significance to what GCC is doing; it doesn't matter where the unused gap is (which exists because of stack alignment). In this case it's the 4 bytes at [rsp], aka [rbp - 0x10].
The 4 bytes at [rbp - 4] are used for q.

Also, you didn't tell GCC to optimize, so there's no reason to expect its choices to even be optimal or a useful guide to learn from. -O3 with volatile int locals would make more sense. (But since there's nothing significant going on, still not actually helpful.)

The things that matter:

Local vars should be naturally aligned (dword values at least 4-byte aligned). The C ABI requires this: alignof(int) = 4. RSP before a call will be 16-byte aligned, so on function entry RSP-8 is 16-byte aligned.
Code size: As many as possible of your addressing modes can use small (signed 8-bit) displacements¹ from RBP (or RSP if you address your locals relative to RSP like gcc -fomit-frame-pointer).

This is trivially the case when you only have a few scalar locals, nowhere near 128 bytes of them.
Any locals you can operate on together are adjacent, and preferably not crossing an alignment boundary, so you can most efficiently init them both / all with one qword or XMM store.

If you have a lot of locals (or an array), group them for spatial locality if there's one whole cache line that might be "cold" while this function (and its children) are running.
Spatial locality: variables you use earlier in your function should be higher in the stack frame (closer to the return address which was stored by the call to this function). The stack is typically hot in cache, but touching a new cache line of stack memory as it grows will be slightly less of an impact if its done after earlier loads/stores. Out-of-order exec can hopefully get to those later store instructions soon and get that cache-miss store into the pipeline to start an RFO (read for ownership) early, minimizing time spent with earlier loads clogging up the store buffer.

This only matters across boundaries wider than 16 bytes; you know everything within one 16-byte aligned chunk is in the same cache line.

A descending access pattern within one cache line might possibly trigger prefetch of the next cache line downward, but I'm not sure if that happens in real CPUs. If so, that might be a reason not to do this, and to favour storing first to the bottom of your stack frame (at RSP, or the lowest red-zone address you'll actually use).

If there's unused space for stack alignment before another call, it's usually only 8 bytes at most. That's much smaller than a cache line and thus doesn't have any significant impact on spatial locality of your local variables. You know the stack pointer alignment relative to a 16-byte boundary, so the choice of leaving padding at the top or bottom of your stack frame never makes a difference between potentially touching a new cache cache line or not.

If you're passing pointers to your locals to different threads, beware false sharing: probably separate those locals by at least 64 bytes so they'll be in different cache lines, or even better by 128 bytes (L2 spatial prefetcher can create "destructive interference" between adjacent cache lines).

Footnote 1: x86 sign-extended 8-bit vs. sign-extended 32-bit displacements in addressing modes like [rsp + disp8] are why the x86-64 System V ABI chose a 128-byte red-zone below RSP: it gives at most a ~256-byte are that can be accessed with more compact code-size, including the red-zone plus reserved space above RSP.

PS:

Note that you don't have to use the same memory location for the same high-level "variable" at every point in your function. You could spill/reload something to one location in one part of a function, and another location later in the function. IDK why you would, but if you have wasted space for alignment it's something you could do. Possibly if you expect one cache line to be hot early on (e.g. near the top of the stack frame on function entry), and another cache line to be hot later (near some other vars that were being used heavily).

A "variable" is a high-level concept you can implement however you like. This isn't C, there's no requirement that it have an address, or have the same address. (C compilers in practice will optimize variables into registers if the address isn't taken, or doesn't escape the function after inlining.)

This is kind of off-topic or at least a pedantic diversion; normally you do simply use the same memory location for the same thing consistently, when it can't be in a register.

I am not expecting GCC on level `-O0` to optimise anything, honestly, this question was not about C or GCC, I just had to illustrate somehow the behaviour whose significance I was not sure about and GCC was just an aside. I also chose a leaf function precisely to avoid the red zone, I am already aware of its existence, this is good. Your answer is, as always, very comprehensive and I am thankful for that. If you could just please state directly that the order does not matter (from rsp to rbp or the other way around), which was the core of my query, I will be most happy to accept it. — , Sep 09 '20 at 18:50
In reference to your edit, I understand that this is not a variable in the sense used by higher level languages. What would you recommend I use instead to make myself more understandable by professional assembly programmers here on SO or elsewhere? Just "memory location" or "stack location"? Assembly is not a full-time occupation for me and I just do not know. — , Sep 09 '20 at 18:58
@Terry: About 80% of your question is taken up by discussion of GCC and that example, I thought it was reasonable to spend maybe 25% of my answer on that part of your question and what better methodology would have been to create a better example. — Peter Cordes, Sep 09 '20 at 19:05
@Terry: Re: direct answer: Maybe you missed the first edit within the 5 minute "grace period" that added the first paragraph "It makes zero difference, do whichever you want". The answer is really that simple unless you want to make it complicated for minor performance concerns (but that's why you'd hand-write asm in 2020 in the first place so that's what the entire rest of the answer is about). — Peter Cordes, Sep 09 '20 at 19:05
Thanks, yes, I missed the first edit. As to the rest, I have many years of experience with high-level languages and I am learning now how things work on the lower levels so all of your elucidating answers are thoroughly read, links are followed and each answer or comment gives me hours and hours of new learning experience. Yourself and other regulars here on `x86` or `assembly` should write a book one day, unless you already have? — , Sep 09 '20 at 19:11
Re: GCC, I get it - I just did not realise that it would be construed to be core point of the question. But it is all fine, thanks for pointing out how it looked like. — , Sep 09 '20 at 19:15
@Terry: I don't know if I have the patience to collect everything into one coherent book. A few years ago I got an email from someone asking if I'd write or contribute to an asm book but I never replied. I feel like my collection of SO answers is something I'm happier with, since I can edit them to correct CPU-architecture misunderstandings when I later learn CPUs didn't work exactly the way I thought. (Although an index to the more useful ones would be good.) Agner Fog has already written a nearly book-length x86 optimization guide (https://agner.org/optimize/) which is quite good. — Peter Cordes, Sep 09 '20 at 19:16
Agner Fog's materials are great, I know. But there still is a dearth of books about low-level programming for experienced people. — , Sep 09 '20 at 19:20
Alright, @Peter Corders, I might have overdone with the edit about "haxors" but here I am, trying meticulously to post something constructive and given a downvote for who knows what really. But I understand this was too much, I will refrain from doing it in the future. — , Sep 09 '20 at 19:23
@Terry: If you re-read your question yourself, I hope you can see how it's easy for someone skimming your question to think you're asking "why is GCC doing it this way?". I had to read carefully to see that wasn't the case, finding the couple places where you were talking about something other than GCC output. Also, the GCC example is the only real detail / code in what you posted. Even for people that did understand your real question, there's little to say about it (Jester answered it in one comment) but lots to say about your GCC example, which is what Jester and I were doing. — Peter Cordes, Sep 09 '20 at 19:24
I understand it now, yes, looking at it from your perspective, I can see what you mean - I will keep it in mind when posting things in the future. — , Sep 09 '20 at 19:26
I re-read your answer a couple of times and there is one part unclear to me. In the third bullet you mention a lot of locals or an array to be grouped for spatial locality "if there's one whole cache line that might be cold while this function or its children are running." Could you perhaps expand it with an example or maybe somehow reword it? I am not sure what it means to have just one whole cache line cold while the function is running. — , Sep 10 '20 at 08:48
@Terry: Say you have `char buf[128]` that you only use once at the start of your function. If you put that at the top of your stack frame, it spans 3 or maybe just 2 cache lines, and the top 2 of those will be untouched (along with your return address) until the function returns. But if you have another local above the array and you use that scalar local in a loop, both ends of the array might have to stay hot in cache, not getting evicted from L1d to make room for something more valuable. — Peter Cordes, Sep 10 '20 at 09:26

Significance of laying out stack variables starting nearer rsp than rbp

1 Answers1