14
int read_val();
long read_and_process(int n) {
    long vals[n];
    for (int i = 0; i < n; i++)
        vals[i] = read_val();
    return vals[n-1];
}

The assembly language code compiled by x86-64 GCC 5.4 is:

read_and_process(int):
        pushq   %rbp
        movslq  %edi, %rax
>>>     leaq    22(,%rax,8), %rax
        movq    %rsp, %rbp
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        andq    $-16, %rax
        leal    -1(%rdi), %r13d
        subq    %rax, %rsp
        testl   %edi, %edi
        movq    %rsp, %r14
        jle     .L3
        leal    -1(%rdi), %eax
        movq    %rsp, %rbx
        leaq    8(%rsp,%rax,8), %r12
        movq    %rax, %r13
.L4:
        call    read_val()
        cltq
        addq    $8, %rbx
        movq    %rax, -8(%rbx)
        cmpq    %r12, %rbx
        jne     .L4
.L3:
        movslq  %r13d, %r13
        movq    (%r14,%r13,8), %rax
        leaq    -32(%rbp), %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %rbp
        ret

Why is there a need to calculate 8*%rax+22 and then AND with -16, since there could be 8*%rax+16, which gives the same result and looks more natural?

Other assembly language code compiled by x86-64 GCC 11.2 looks almost the same, with the number 22 being replaced by 15. So is the number determined just by random, or because of some reasons?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
lastans7
  • 143
  • 5
  • 2
    You know that [variable-length arrays aren't part of the C++ standard?](https://stackoverflow.com/q/1887097). So the tags `c++` and `variable-length-array` are contradictory. I suggest you retag with `c` language to have better support (C++ programmers hate VLA) – prapin Oct 17 '21 at 15:25
  • 10
    @prapin There's no ban on discussing non-standard extensions. If OP compiles this as C++, then the C++ tag is no less appropriate than C. – HolyBlackCat Oct 17 '21 at 15:27
  • 3
    My guess is, there are 6 bytes of bookkeeping information that need to be in memory before the first element of the array, hence `+6`. Then `+16` and AND with `-16` is a trick to align on 16-byte boundary (`AND -16` clears the 4 lower bits). – Igor Tandetnik Oct 17 '21 at 15:30
  • 3
    Adding 15 makes the most sense, because adding 15 and ANDing with -16 has the effect of rounding up to the next multiple of 16, which would be necessary for alignment. Adding 16 would waste space if `n` is already even. 22 is harder to explain, but one note is 22 = 15 + 7, where 7 is one less than `sizeof(long)`. I would wonder if the compiler tried to align twice, once up to a multiple of 8 (needless) and then again up to a multiple of 16, and naively combined the additions without noticing it was redundant. That could be a minor bug in GCC 5 that was fixed later. – Nate Eldredge Oct 17 '21 at 15:30
  • If you really care you can read the RTL produced by GCC's internal code generation and optimization passes. My guess is that the 15 and 7 might appear separately somewhere in there. – Nate Eldredge Oct 17 '21 at 15:32
  • Does this answer your question? [How does GCC implement variable-length arrays?](https://stackoverflow.com/questions/21182307/how-does-gcc-implement-variable-length-arrays) – Richard Critten Oct 17 '21 at 15:37
  • 4
    In the [unoptimized version](https://godbolt.org/z/6rzxfGrbd) you can see it adding 7, then adding 15, then rounding down to a multiple of 16 (lines 21-28). So the optimized version just merges these operations into one, hence the 22. But adding 7 was unnecessary all along, so maybe that was the bug. – Nate Eldredge Oct 17 '21 at 17:16
  • @NateEldredge That's a feature, not a bug. See my answer to this question. – Soonts Oct 17 '21 at 20:43
  • @NateEldredge Could you explain why adding 15 and anding with -16 has the effect of rounding up to next multiple of 16? I got almost there, but couldn’t fully understood that. ANDing with `0xFF…0` would blow up the lower 4bit so that the bit string would be multiple of 16. But i couldn’t understand why the 15 is added. It would be helpful to write it as an answer, not comment. Thanks. – rosshjb Jan 30 '22 at 05:56
  • 1
    @rosshjb: I plan to come back and write an answer when I have some time. But try some examples. You've already understood why AND with `-16` (which is just another name for `0xfffffffffffffff0`) effectively rounds *down* to a multiple of 16. As to why adding 15 first causes it to round *up*, try for yourself and see what happens when you apply this to inputs like 64, 65, 66, 78, 79 and 80. – Nate Eldredge Jan 30 '22 at 05:59
  • @NateEldredge Great. The addition makes the original bit string’s upper `64-4=60` bits to be changed when it has to be rounded up to next multiple of 16. Otherwise, it retains original multiple of 16 value. By the way, is there a relation in that the `n` is even or odd? You said that “Adding `16` would waste space if `n` is already even” in previous comment. What’s the point the `n` is even or odd number? **Edit)** When the `n` is odd, after ANDing,`8n+15` would be `8n+8`. Otherwise, when the `n` is even, after ANDing, `8n+15` would be `8n`. In either case, the result is multiple of 16. – rosshjb Jan 30 '22 at 07:52
  • 1
    @rosshjb: Suppose the size was set to `(x+16)&-16`. If `n` is even, then `8n` is already a multiple of 16 and so need not be increased at all. But `(8n+16)&-16 = 8n+16` which is unnecessarily large. That's all I meant. For odd `n`, `(8n+16)&-16 == 8n+8 == (8n+15)&-16` so 15 versus 16 doesn't make a difference in that case. – Nate Eldredge Jan 30 '22 at 08:19
  • @NateEldredge Thank you. But… The addition confuses me again. The `x` is already multiple of `8` because it is `8n`. Therefore the `x` can’t be `65`, `66` and etc. So, why the expression can’t be `(8n+8)&-16`? – rosshjb Feb 01 '22 at 03:48
  • 1
    Yes, `8n+8` would work in this case. But 15 has exactly the same effect, has the same cost in execution and code size, and it has the advantage that the compiler can use the same calculation for any array regardless of its element size (e.g. for `char a[n];` then 15 is the only number that would work.) – Nate Eldredge Feb 01 '22 at 05:32
  • @NateEldredge I checked your [answer](https://stackoverflow.com/a/70966058/9304999). *This comment may be somewhat off-topic.* It seems that logic for 16-byte aligning is `ceil(x/16)*16` mathematically. But logic the gcc performs is `floor((x+15)/16)*16`. Is there a special reason to perform the work in the latter way? Yes, both expressions are equivalent — i found the proof in [here](https://math.stackexchange.com/a/3850601/533680). The latter can be performed as clearing low-order bits. Is it possible to perform the work in the former way? – rosshjb Feb 03 '22 at 11:05
  • 1
    @rosshjb: Mathematically, yes, those are equivalent. But the `floor` is effectively built into the very efficient mask operation: `y & -16 == floor(y/16)*16`. This is likewise true for other unsigned integer operations: `y << 4 == floor(y/16)`, and even the (much more expensive) general unsigned integer division truncates toward zero. On the other hand, it is not clear how to directly perform a `ceil` with similar efficiency. By the time you do an integer divide, the `floor` is already done and it's too late to `ceil`. – Nate Eldredge Feb 03 '22 at 14:59
  • 1
    @rosshjb: You could seemingly do it with floating point, but that would not be able to handle all 64-bit integer inputs with full precision. It would also be inefficient, and in general compilers avoid unnecessary floating-point operations to keep support for machines without such hardware. So the short answer to "is it possible to perform the work in a former way" is "not in any way that I know of, except with the +15 trick I mentioned". Indeed the +15 is the standard *way* of doing ceiling division. – Nate Eldredge Feb 03 '22 at 15:02
  • 1
    @rosshjb: Basically, thinking about operations like `floor` and `ceil` is usually unhelpful in integer contexts. You have a fundamentally integer operation, but you're resorting to the real numbers (or at least the rationals) to define it. Computers are great with integers, less good with rationals, and can deal with real numbers only very approximately. – Nate Eldredge Feb 03 '22 at 15:04

2 Answers2

7

Summary: The number is not random, it's part of a calculation that ensures proper stack alignment. The number should be 15, and the 22 is the result of a minor bug in old versions of GCC.


Recall that the x86-64 SysV ABI mandates 16-byte stack alignment; the stack pointer must be a multiple of 16 prior to any call instruction. Hence when we enter read_and_process, the stack pointer is 8 less than a multiple of 16, because the call that got us here pushed 8 bytes. So prior to calling read_val(), the stack pointer must be decremented by 8 more than a multiple of 16, i.e. an odd multiple of 8. The prologue pushes an odd number of registers (five, namely rbp, r14, r13, r12, rbx) at 8 bytes each. So the remaining stack adjustment must be a multiple of 16.

So whatever amount of memory is to be allocated for the array vals, it must be rounded up to a multiple of 16. A standard way to do that is to add 15, then AND with -16: adjusted = (orig + 15) & -16.

Why does that work? -16, thanks to two's complement arithmetic, has the low 4 bits clear and the others set, so AND with -16 results in a multiple of 16 - but since the AND clears low-order bits, the result of x & -16 is less than x; this is rounding down. If we add 15 first (which is, of course, 1 less than 16), the net effect is to round up instead. Adding 15 to orig will cause it to pass a multiple of 16, and then & -16 will round down to that multiple of 16. Unless orig was already a multiple of 16, in which case orig+15 rounds down back to orig itself. So this does the right thing in all cases.

That's what GCC does from 8.1.0 onward. Adding 15 is baked into the same lea that multiplies n by 8, and the AND with -16 comes several lines later.

In this case, since orig = 8*n is already a multiple of 8, there are other values besides 15 that would work just as well; 8, for instance (though not 16, see below). But using 15 is completely equivalent mathematically and in terms of code size and speed, and since 15 works regardless of previous alignment, the compiler authors can just use 15 unconditionally without writing extra code to keep track of what alignment orig may already have.


But adding 22 instead, as older GCC does, is clearly wrong. If orig was already a multiple of 16, say orig = 32, then orig+22 is 54, which rounds down to 48. But 32 bytes was already a perfectly good size so we just wasted 16 bytes for no reason. (Here orig is 8*n so this would occur if the input n is even.) For similar reasons, your suggestion of using 16 instead of 22 would also be wrong.

So the 22 is a bug. It's a pretty minor bug; the resulting code still works just fine and complies with the ABI, and the only ill effect is that sometimes a little bit of stack space is wasted. But it was fixed for GCC 8.1.0 by a commit entitled "Improve alloca alignment". (alloca is an old non-standard function that does dynamic stack allocation, and compiler writers often use the term to refer to any stack allocation.)

Apparently the issue was that some previous pass of the compiler had determined that the size needed to be aligned to (at least) 8 bytes, which would have been accomplished by adding 7 and ANDing with -8 (which might later be optimized away when the compiler realized later that n*8 is already aligned to 8 bytes). Now this constraint should be made redundant when the compiler realizes that 16-byte alignment is actually required, as every multiple of 16 is already a multiple of 8. But the compiler erroneously adds the offsets 7 and 15, when the right thing to do was to take their maximum (which is what the commit implemented). And 7 + 15 is... 22.

If you compile the code using GCC 5.4 with optimizations off, you can see these two operations happening separately:

        lea     rdx, [rax+7]  ; add 7 to rax and write to rdx
        mov     eax, 16
        sub     rax, 1        ; now rax = 15
        add     rax, rdx      ; add 15 to rdx

and with optimizations on, the optimizer combines these into a single add of 22 - without noticing that the add of 7 should not have been there to begin with. In newer versions of GCC with -O0, the lea rdx, [rax+7] is gone.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
2

why there need to calculate 8*%rax+22 and then AND with -16, since there could be 8*%rax+16, which gives the same result and looks more naturely.

It does not give the same result. The expression ( ( rax*8 + 22 ) % -16 ) aligns output by 16 bytes.

On 64-bit CPUs, -16 is equivalent to 0xFFFFFFFFFFFFFFF0 When written that way, it’s obvious what the AND instruction is doing: it strips the four least significant bits from the value, and this makes the result aligned by 16 bytes, rounding down. The ( ( rax*8 + 15 ) % -16 ) expression results in the alignment by 16 bytes, rounding up. But the compiler wants eight more bytes of the alignment, because it pushed five values to the stack with five push instructions, and each one is eight bytes.

Your next question is probably going to be “why align by 16 bytes when alignof(long)=8?”

The answer is the preferred-stack-boundary compiler option. The option defaults to 4 in GCC, which means the compiler aligns stack frames by 2^4 = 16 bytes.

Try to compile the same code with -mpreferred-stack-boundary=3 (which, BTW, is the minimum allowed value for AMD64. It requires the alignment to be at least one pointer in size) and see what happens to the assembly.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • 1
    I disagree. First, the "additional 8 bytes" would only make sense if the additional 8 was added after doing `& -16` (I assume `%` is a typo). As it stands, the resulting value in `rax` is aligned to 16 bytes (even multiple of 8), not 16+8 (odd multiple). Second, having an even multiple of 8 is correct, even though there were 5 pushes; the extra 8 bytes came from the `call` instruction that called this function. Thus when we make a `call`, the stack will be aligned to 16 bytes, just as it was aligned to 16 bytes *before* the `call` that called us. – Nate Eldredge Oct 17 '21 at 20:50
  • @NateEldredge Look at the GCC output when stacks frames are aligned by 8 bytes instead of 16: https://godbolt.org/z/W4fKWbTxY No magic numbers, and no rounding of the pointers. You’re correct about the call instruction, though. To align things correctly on AMD64, compilers are aligning the stack by `16n+8` bytes before calling functions: https://devblogs.microsoft.com/oldnewthing/20040114-00/?p=41053 – Soonts Oct 17 '21 at 21:01
  • 2
    I must be missing the point. We're trying to determine why GCC 5.x uses the funny number 22. The 16-byte alignment would be attained correctly by replacing that number with 15 or anything larger, so anything more is just wasting stack, and that's why I think it's a bug. No other value for that number would achieve 16 bytes + 8 alignment. I'm not sure where `-mpreferred-stack-boundary` comes in, except that it obviously changes the code because it doesn't need 16-byte alignment anymore. – Nate Eldredge Oct 17 '21 at 21:09
  • 1
    And I don't understand your point about `-mpreferred-stack-boundary=3`. The machine doesn't require any particular alignment for the stack pointer in general. But the ABI requires 16 byte alignment, not 8, in order to make it easier to use aligned SSE instructions on the stack, so `-mpreferred-stack-boundary=3` won't be ABI-compliant. – Nate Eldredge Oct 17 '21 at 21:10