Why is gcc allowed to speculatively load from a struct?

Question

Example Showing the gcc Optimization and User Code that May Fault

The function 'foo' in the snippet below will load only one of the struct members A or B; well at least that is the intention of the unoptimized code.

typedef struct {
  int A;
  int B;
} Pair;

int foo(const Pair *P, int c) {
  int x;
  if (c)
    x = P->A;
  else
    x = P->B;
  return c/102 + x;
}

Here is what gcc -O3 gives:

mov eax, esi
mov edx, -1600085855
test esi, esi
mov ecx, DWORD PTR [rdi+4]   <-- ***load P->B**
cmovne ecx, DWORD PTR [rdi]  <-- ***load P->A***
imul edx
lea eax, [rdx+rsi]
sar esi, 31
sar eax, 6
sub eax, esi
add eax, ecx
ret

So it appears that gcc is allowed to speculatively load both struct members in order to eliminate branching. But then, is the following code considered undefined behavior or is the gcc optimization above illegal?

#include <stdlib.h>  

int naughty_caller(int c) {
  Pair *P = (Pair*)malloc(sizeof(Pair)-1); // *** Allocation is enough for A but not for B ***
  if (!P) return -1;

  P->A = 0x42; // *** Initializing allocation only where it is guaranteed to be allocated ***

  int res = foo(P, 1); // *** Passing c=1 to foo should ensure only P->A is accessed? ***

  free(P);
  return res;
}

If the load speculation will happen in the above scenario there is a chance that loading P->B will cause an exception because the last byte of P->B may lie in unallocated memory. This exception will not happen if the optimization is turned off.

The Question

Is the gcc optimization shown above of load speculation legal? Where does the spec say or imply that it's ok? If the optimization is legal, how is the code in 'naughtly_caller' turn out to be undefined behavior?

Why would you even consider allocating less memory than the structure actually need? What is the actual problem you have? Or is this just plain curiosity? — Some programmer dude, Oct 02 '17 at 09:03
By allocating a `Pair` that is smaller than `sizeof(Pair)` you're in undefined behavior territory. You have no reasonable expectation that your code will work correctly after this. — Sean, Oct 02 '17 at 09:03
Of course the optimisation is legal. The function expects a `struct Pair *p`, but you give it a pointer to a different (and smaller) object. — joop, Oct 02 '17 at 09:07
@joop that's of course nitpicking, because the result (UB) is the same, but still, it's not a "different object" (at least not if you mean the *type*) but more like Sean put it: "a `Pair` that is smaller than `sizeof(Pair)`" (sounds like a contradiction, of course makes no sense). Not just the function expects a `Pair`, the compiler "knows" it's a `Pair` because *it is* ... it's broken because of the size mismatch. — , Oct 02 '17 at 09:34
Not directly an answer to the question about loads, but since 6.7.2.1/15 says a pointer to a structure may always be converted to a pointer to its first member, there *is* a standard-compliant way to do this, if for some reason it's required: access `A` with `x = *(int *)P` instead. (GCC will then emit the branch.) Doesn't really generalize though. — Alex Celeste, Oct 02 '17 at 10:58
@Leushenko: This code is essentially doing the reverse: it's converting a pointer to the first `N-1` bytes into a pointer to `N` bytes. No wonder that fails. — MSalters, Oct 02 '17 at 13:23
[Don't cast the result of `malloc` in C](https://stackoverflow.com/q/605845/995714) — phuclv, Oct 02 '17 at 15:03
The key element is that you read `P->A` (spelled this way), which is a promise that P has type Pair. If you write instead `int const*p=&P->A;x=*p;`, you are reading an `int const*`, and the compiler cannot assume anything (and the generated asm is different). — Marc Glisse, Oct 02 '17 at 19:20
If you genuinely need an object that holds either of two objects that have different sizes and possibly alignment restrictions, or a pointer thereto, that’s what a `union` is for. — Davislor, Oct 03 '17 at 03:52
It looks like gcc only does this optimization when the two members are consecutive in the `struct`. Code like your `naughty_caller` does exist in the real world, so this might be a compromise to keep it functioning in at least some cases, by reducing the chance of this optimization occurring on members that cross a page boundary. — Nate Eldredge, Feb 21 '21 at 18:21

score 53 · Answer 1 · answered Oct 02 '17 at 09:16

53

Reading a variable (that was not declared as volatile) is not considered to be a "side effect" as specified by the C standard. So the program is free to read a location and then discard the result, as far as the C standard is concerned.

This is very common. Suppose you request 1 byte of data from a 4 byte integer. The compiler may then read the whole 32 bits if that's faster (aligned read), and then discard everything but the requested byte. Your example is similar to this but the compiler decided to read the whole struct.

Formally this is found in the behavior of "the abstract machine", C11 chapter 5.1.2.3. Given that the compiler follows the rules specified there, it is free to do as it pleases. And the only rules listed are regarding volatile objects and sequencing of instructions. Reading a different struct member in a volatile struct would not be ok.

As for the case of allocating too little memory for the whole struct, that's undefined behavior. Because the memory layout of the struct is usually not for the programmer to decide - for example the compiler is allowed to add padding at the end. If there's not enough memory allocated, you might end up accessing forbidden memory even though your code only works with the first member of the struct.

answered Oct 02 '17 at 09:16

Lundin

195,001
40
254
396

9

BTW: the compiler would even be allowed to load the {A,B} members into a single (wide enough) register, and extract the A or B values using shifting+masking. – joop Oct 02 '17 at 09:49
3

Note: the early 8086/8088 (and even the early compilers for it) used this register-slicing trick for the ah+al registers. – joop Oct 02 '17 at 16:24
This fails to address the fact that C pointers can point to whatever unrelated type, it is only when you dereference them then read/write in it that you get some guarantees. – Marc Glisse Oct 02 '17 at 19:30
2

@MarcGlisse: Dereferencing is enough to give undefined behavior: `The operand of unary * has an invalid value`. Similarly, the documentation of `->` says that its use refers to a *member of a structure object*, so if `P->A` is well-defined, `P` must point to an instance of `struct pair`. – Oct 03 '17 at 04:15
2

@MarcGlisse What does that have to do with the question? – Lundin Oct 03 '17 at 06:31
@Lundin If you replace `x=P->A` with `x=*(int const*)P`, the "surprising" code noticed by the OP disappears, why? And your paragraph 1 seems to give the compiler freedom to dereference `(int*)0` gratuitously. – Marc Glisse Oct 03 '17 at 07:28
@MarcGlisse That would be a different question. Your scenario will have something to do with pointer aliasing. I'm only answering the question posted by the OP here. – Lundin Oct 03 '17 at 08:31
@Lundin the question is why the compiler can load speculatively. Your answer does not make a difference between `x=P->A` where there is a speculative load and `x=*(int const*)P` where there isn't, so I fail to see how you answered the OP... – Marc Glisse Oct 03 '17 at 10:01
@MarcGlisse No, you can find the question(s) below the incredibly large headline called "The Question". Also, feel free to post an answer of your own. – Lundin Oct 03 '17 at 10:42

Jens Gustedt · Answer 2 · 2017-10-03T16:14:20.227

13

No, if *P is allocated correctly P->B will never be in unallocated memory. It might not be initialized, that is all.

The compiler has every right to do what they do. The only thing that is not allowed is to oops about the access of P->B with the excuse that it is not initialized. But what and how they do all of this is under the discretion of the implementation and not your concern.

If you cast a pointer to a block returned by malloc to Pair* that is not guaranteed to be wide enough to hold a Pair the behavior of your program is undefined.

edited Oct 03 '17 at 16:14

answered Oct 02 '17 at 09:08

Jens Gustedt

76,821
6
102
177

I wonder, if one uses an implementation where `malloc` always allocates sizes that are a multiple of four bytes (or returns `NULL`), and `sizeof(int) == alignof(int) == 4`, so that the _implementation_ guarantees that `malloc(sizeof(Pair) - 1)` allocates enough to accommodate a `Pair` if successful, would that make the behaviour defined? – Daniel Fischer Oct 02 '17 at 12:13
@DanielFischer starting such a thought with "*if [...] an implementation [...]*" should always suggest the answer. As long as you don't mean *implementation defined*... – Oct 02 '17 at 12:15
1

@DanielFischer, no as long as by "defined behavior" you mean the behavior that is defined by the C standard. Any platform is free to define any extensions they want, but that is is outside the standard and thus not portable. – Jens Gustedt Oct 02 '17 at 12:22
@DanielFischer: malloc's alignment is always guaranteed to be a multiple of the system word size to prevent such explosions as this one. Basically, the standard says this won't crash so the compiler and malloc() must cooperate so that it won't. Watch it though; that unallocated byte is allowed to be allocated by a malloc(1) call or contain a guard value so don't ever write to it. – Joshua Oct 03 '17 at 04:38

score 8 · Answer 3 · 2017-10-02T09:17:37.747

8

This is perfectly legal because reading some memory location isn't considered an observable behavior in the general case (volatile would change this).

Your example code is indeed undefined behavior, but I can't find any passage in the standard docs that explicitly states this. But I think it's enough to have a look at the rules for effective types ... from N1570, §6.5 p6:

If a value is stored into an object having no declared type through an lvalue having a type that is not a character type, then the type of the lvalue becomes the effective type of the object for that access and for subsequent accesses that do not modify the stored value.

So, your write access to *P actually gives that object the type Pair -- therefore it just extends into memory you didn't allocate, the result is an out of bounds access.

edited Oct 02 '17 at 09:17

answered Oct 02 '17 at 09:12

2

I think the reasoning should be the same as for the *struct hack*. (though in that case the declared size would be *smaller* than the actual size) – joop Oct 02 '17 at 09:16
@joop I don't think so, the *struct hack* accesses an array out of bounds ... related, but not the same thing. – Oct 02 '17 at 09:18
@joop there's no passage talking *explicitly* about the size of *allocated objects*, but accessing memory you didn't allocate is of course UB and the changing of the effective type makes clear the problem is not (like commented on the question) passing something that isn't a pointer to `Pair`, but the simple fact that this `Pair` extends to unallocated memory (e.g. because a *complete type* includes size information). – Oct 02 '17 at 09:22
"So, your write access to `*P`" -- There is no write access to `*P`. There is a write to `P->A`, but that is an lvalue of type `int`, not of type `Pair`. This is a severely underspecified area in the standard, IMO. – Oct 02 '17 at 12:00
1

@hvd this is a problem with the wording of the standard indeed. It's clear what is *meant* (IMHO) as you can't have just part of an object and the object is of type `Pair`, but speaking about the *lvalue* is imprecise at best here. – Oct 02 '17 at 12:12
The first sentence gives the impression that the compiler is free to dereference (int*)0 gratuitously at any time... – Marc Glisse Oct 02 '17 at 19:35
2

@hvd the way compilers interpret it is that the spelling of the lvalue is important. `x=P->A` accesses an object of type Pair, while `const int*q=&P->A;x=*q` accesses an object of type int (supposedly even `x=*(int const*)&P->A` does that, and it does for clang, but gcc has a bug, and I forget what `x=*&P->A` is supposed to do). – Marc Glisse Oct 02 '17 at 19:39
1

@MarcGlisse Yes, that'll be what's intended since it matches the example in the discussion on DR #236 for union member access, but the standard doesn't say that and as you point out, there are corner cases where the intent is unclear. Another corner case might simply be a parenthesised structure member access: C++ has a general rule that a "parenthesized expression can be used in exactly the same contexts as those where the enclosed expression can be used, and with the same meaning, except as otherwise indicated" but C doesn't, and GCC *does* treat parenthesised expressions differently in C. – Oct 02 '17 at 21:57

score 6 · Answer 4 · answered Oct 03 '17 at 04:25

6

A postfix expression followed by the -> operator and an identifier designates a member of a structure or union object. The value is that of the named member of the object to which the first expression points

If invoking the expression P->A is well-defined, then P must actually point to an object of type struct Pair, and consequently P->B is well-defined as well.

answered Oct 03 '17 at 04:25

1

The way several compilers interpret it, "designates a member of a structure or union object" means that `->` does pointer arithmetic (add the offset), but it is only when you try to **access the value** (either read or write) that the second sentence comes into play and you get some guarantees about an underlying object (so `x=P->A` is good, bug `void*z=&P->A` would not guarantee anything). The standard is not clear about that, but it seems relevant to know what compilers do. – Marc Glisse Oct 03 '17 at 07:37

Peter Cordes · Answer 5 · 2017-10-03T04:52:36.493

A -> operator on a Pair * implies that there's a whole Pair object fully allocated. (@Hurkyl quotes the standard.)

x86 (like any normal architecture) doesn't have side-effects for accessing normal allocated memory, so x86 memory semantics are compatible with the C abstract machine's semantics for non-volatile memory. Compilers can speculatively load if/when they think that will be a performance win on target microarchitecture they're tuning for in any given situation.

Note that on x86 memory protection operates with page granularity. The compiler could unroll a loop or vectorize with SIMD in a way that reads outside an object, as long as all pages touched contain some bytes of the object. Is it safe to read past the end of a buffer within the same page on x86 and x64?. libc strlen() implementations hand-written in assembly do this, but AFAIK gcc doesn't, instead using scalar loops for the leftover elements at the end of an auto-vectorized loop even where it already aligned the pointers with a (fully unrolled) startup loop. (Perhaps because it would make runtime bounds-checking with valgrind difficult.)

To get the behaviour you were expecting, use a const int * arg.

An array is a single object, but pointers are different from arrays. (Even with inlining into a context where both array elements are known to be accessible, I wasn't able to get gcc to emit code like it does for the struct, so if it's struct code is a win, it's a missed optimization not to do it on arrays when it's also safe.).

In C, you're allowed to pass this function a pointer to a single int, as long as c is non-zero. When compiling for x86, gcc has to assume that it could be pointing to the last int in a page, with the following page unmapped.

Source + gcc and clang output for this and other variations on the Godbolt compiler explorer

// exactly equivalent to  const int p[2]
int load_pointer(const int *p, int c) {
  int x;
  if (c)
    x = p[0];
  else
    x = p[1];  // gcc missed optimization: still does an add with c known to be zero
  return c + x;
}

load_pointer:    # gcc7.2 -O3
    test    esi, esi
    jne     .L9
    mov     eax, DWORD PTR [rdi+4]
    add     eax, esi         # missed optimization: esi=0 here so this is a no-op
    ret
.L9:
    mov     eax, DWORD PTR [rdi]
    add     eax, esi
    ret

In C, you can pass sort of pass an array object (by reference) to a function, guaranteeing to the function that it's allowed to touch all the memory even if the C abstract machine doesn't. The syntax is int p[static 2]

int load_array(const int p[static 2], int c) {
  ... // same body
}

But gcc doesn't take advantage, and emits identical code to load_pointer.

Off topic: clang compiles all versions (struct and array) the same way, using a cmov to branchlessly compute a load address.

    lea     rax, [rdi + 4]
    test    esi, esi
    cmovne  rax, rdi
    add     esi, dword ptr [rax]
    mov     eax, esi            # missed optimization: mov on the critical path
    ret

This isn't necessarily good: it has higher latency than gcc's struct code, because the load address is dependent on a couple extra ALU uops. It is pretty good if both addresses aren't safe to read and a branch would predict poorly.

We can get better code for the same strategy from gcc and clang, using setcc (1 uop with 1c latency on all CPUs except some really ancient ones), instead of cmovcc (2 uops on Intel before Skylake). xor-zeroing is always cheaper than an LEA, too.

int load_pointer_v3(const int *p, int c) {
  int offset = (c==0);
  int x = p[offset];
  return c + x;
}

    xor     eax, eax
    test    esi, esi
    sete    al
    add     esi, dword ptr [rdi + 4*rax]
    mov     eax, esi
    ret

gcc and clang both put the final mov on the critical path. And on Intel Sandybridge-family, the indexed addressing mode doesn't stay micro-fused with the add. So this would be better, like what it does in the branching version:

    xor     eax, eax
    test    esi, esi
    sete    al
    mov     eax, dword ptr [rdi + 4*rax]
    add     eax, esi
    ret

Simple addressing modes like [rdi] or [rdi+4] have 1c lower latency than others on Intel SnB-family CPUs, so this might actually be worse latency on Skylake (where cmov is cheap). The test and lea can run in parallel.

After inlining, that final mov probably wouldn't exist, and it could just add into esi.

score 3 · Answer 6 · answered Oct 02 '17 at 19:20

3

This is always allowed under the "as-if" rule if no conforming program can tell the difference. For example, an implementation could guarantee that after each block allocated with malloc, there are at least eight bytes that can be accessed without side effects. In that situation, the compiler can generate code that would be undefined behaviour if you wrote it in your code. So it would be legal for the compiler to read P[1] whenever P[0] is correctly allocated, even if that would be undefined behaviour in your own code.

But in your case, if you don't allocate enough memory for a struct, then reading any member is undefined behaviour. So here the compiler is allowed to do this, even if reading P->B crashes.

answered Oct 02 '17 at 19:20

gnasher729

51,477
5
75
98

The implementation would also have to pad the space where it puts static storage (.bss, .data, and .rodata) to make sure no static objects are at the end of a page. But yes, this could work as long as the implementation *doesn't* provide access to any OS system calls that allocate a page, because that could let a conforming program get a pointer to the last object in a page (or whatever; the memory-protection doesn't have to be page-based.) – Peter Cordes Oct 02 '17 at 19:28
Let's look at the following example: `int *x=malloc(...); foo(x+31, y);` In this case `P->B` may be in an unmapped memory page. – Martin Rosenau Oct 03 '17 at 05:19

Why is gcc allowed to speculatively load from a struct?

Example Showing the gcc Optimization and User Code that May Fault

The Question

6 Answers6

Linked

Related