Why does Clang generate different code for reference and non-null pointer arguments?

Question

This is related to Why can't GCC generate an optimal operator== for a struct of two int32s?. I was playing around with the code from that question at godbolt.org and noticed this odd behavior.

struct Point {
    int x, y;
};

bool nonzero_ptr(Point const* a) {
    return a->x || a->y;
}

bool nonzero_ref(Point const& a) {
    return a.x || a.y;
}

https://godbolt.org/z/e49h6d

For nonzero_ptr, clang -O3 (all versions) produces this or similar code:

    mov     al, 1
    cmp     dword ptr [rdi], 0
    je      .LBB0_1
    ret
.LBB0_1:
    cmp     dword ptr [rdi + 4], 0
    setne   al
    ret

This strictly implements the short-circuiting behavior of the C++ function, loading the y field only if the x field is zero.

For nonzero_ref, clang 3.6 and earlier generate the same code as they do for nonzero_ptr, but clang 3.7 through 11.0.1 produce

    mov     eax, dword ptr [rdi + 4]
    or      eax, dword ptr [rdi]
    setne   al
    ret

which loads y unconditionally. No version of clang is willing to do that when the parameter is a pointer. Why?

The only situation I can think of (on the x64 platform) where the behavior of the branching code would be observably different is when there's no memory mapped at [rdi+4], but I'm still unsure why clang would consider that case important for pointers and not references. My best guess is that there is some language-lawyery argument that references must be to "full objects" and pointers needn't be:

char* p = alloc_4k_page_surrounded_by_guard_pages();
int* pi = reinterpret_cast<int*>(p + 4096 - sizeof(int));
Point* ppt = reinterpret_cast<Point*>(pi);  // ok???
ppt->x = 42;  // ok???
Point& rpt = *ppt;  // UB???

But if the spec implies that, I'm not seeing how.

Somewhat related: [Allocating memory for part of a structure](https://stackoverflow.com/questions/53904602/allocating-memory-for-a-part-of-structure) — Nate Eldredge, Feb 21 '21 at 06:12
Looks like [Clang wants](https://godbolt.org/z/6avc4r) both `align 4` and `dereferenceable(8)` to optimize the code and doesn't like `dereferenceable_or_null(8)` instead of `dereferenceable(8)`. — Language Lawyer, Feb 21 '21 at 09:56

Peter Cordes · Answer 1 · 2021-02-21T06:04:28.593

This is a missed optimization; the branchless code is safe for both C++ source versions.

In Why is gcc allowed to speculatively load from a struct? GCC actually is speculatively loading both struct members through a pointer even though the C source only references one or the other. So at least GCC developers have decided that this optimization is 100% safe, in their interpretation of the C and C++ standards (I think that's intentional, not a bug). Clang generates a 0 or 1 index to choose which int to load, so clang is still just as reluctant as in your case to invent a load. (C vs C++: same asm with or without -xc, with a version of the source ported to work as either: https://godbolt.org/z/6oPKKd)

The obvious difference in your asm is that the pointer version avoids access to a->y if a->x != 0, and that this only matters for correctness¹ if a->y was in an unmapped page; you're right about that being the relevant corner case.

But ISO C++ doesn't allow partial objects. The page-boundary setup in your example is I'm pretty sure undefined behaviour. In a path of execution that reads a->x, the compiler can assume it's safe to also read a->y.

This would of course not be the case for int *p; and p[0] || p[1], because it's totally valid to have an implicit-length 0-terminated array that happens to be 1 element long, in the last 4 bytes of a page.

As @Nate suggested in comments, perhaps clang simply doesn't take advantage of that ISO C++ fact when optimizing; maybe it does internally transform to something more like an array by the time it's considering this "if-conversion" type of optimization (branchy to branchless). Or maybe LLVM just don't let itself invent loads through pointers.

It can always do it for reference args because references are guaranteed non-NULL. It would be "even more" UB for the caller to do nonzero_ref(*ppt), like in your partial-object example, because in C++ terms we're dereferencing a pointer to the whole object.

An experiment: deref the pointer to get a full tmp object

bool nonzero_ptr_full_deref(Point const* pa) {
    Point a = *pa;
    return a.x || a.y;
}

https://godbolt.org/z/ejrn9h - compiles branchlessly, same as nonzero_ref. Not sure what / how much this tells us. This is what I expected, given that it makes access to a->y effectively unconditional in the C++ source.

Footnote 1: Like all mainstream ISAs, x86-64 doesn't do hardware race detection, so the possibility of loading something another thread might be writing only matters for performance, and then only if the full struct is split across a cache-line boundary since we're already reading one member. If the object doesn't span a cache line, any false-sharing performance effect is already incurred.

Making asm like this doesn't "introduce data-race UB" because x86 asm has well-defined behaviour for this possibility, unlike ISO C++. The asm works for any possible value loaded from [rdi+4] so it correctly implements the semantics of the C++ source. Inventing reads is thread-safe, unlike writes, and is allowed because it's not volatile so the access isn't a visible side-effect. The only question is whether the pointer must point to a full valid Point object.

Part of data races (on non-atomic objects) being Undefined Behaviour is to allow for C++ implementations on hardware with race detection. Another is to allow compilers to assume that it's safe to reload something they accessed once, and expect the same value unless there's an acquire or seq_cst load between the two points. Even making code that would crash if the 2nd load differed from the first. That's irrelevant in this case because we're not talking about turning 1 access into 2 (instead 0 into 1 whose value may not matter), but is why roll-your-own atomics (e.g. in the Linux kernel) need to use volatile* casts for ACCESS_ONCE (https://lwn.net/Articles/793253/#Invented%20Loads).

Arent both versions a missed optimization? I.e wouldn't you want at&t: ```xorl %eax, %eax; cmpq $0, (%rdi); setnz %al```? Seems to save a full load and don't see how it would be breaking anything ```nonzero_ref``` isnt. — Noah, Feb 22 '21 at 06:59
@Noah: yes, that would probably be even better, except maybe on a store-forwarding stall if the two struct members had been written separately or were split across a cache-line boundary. (Then it would cost more latency, but not necessarily throughput if OoO exec could hide this latency.) This was discussed in detail in comments and an answer on the original question, [Why can't GCC generate an optimal operator== for a struct of two int32s?](https://stackoverflow.com/q/66263263). — Peter Cordes, Feb 22 '21 at 07:06
@Noah: Also, you could save code-size by using `cmp %rax, (%rdi)` since you just xor-zeroed it. Either way can still micro-fuse the load+cmp on Intel, but if that was a RIP-relative addressing mode (for static data), immediate + rip-rel couldn't micro fuse so the register would actually be better. If you were branching instead of `setnz` on the cmp result, `cmp %reg, (%reg)` / `jne` could micro + macro fuse where `cmpq $0, (%reg)` / `jne` could only micro but not macro fuse. (But then you wouldn't need the xor-zeroing if not doing setnz). I should really write up a canonical about cmp mem — Peter Cordes, Feb 22 '21 at 07:12

score 1 · Answer 2 · answered Feb 21 '21 at 05:43

I believe that from the point of view of standard C++, the compiler could emit the same code for both, since there is no provision in the standard for "partial objects" like the one you've constructed. The fact that it doesn't could simply be a missed optimization.

One could compare code like a->x || b->y where the compiler really does have to emit a branch, since the caller could legally pass a null or invalid pointer for b so long as a->x is nonzero. On the other hand, if a,b are references, then a.x || b.y should not need a branch according to the standard, since they must always be references to valid objects. So the "missed optimization" in your nonzero_ptr could just be the compiler not noticing that it can take advantage of the fact that the pointers in a->x and a->y are the same pointer.

Alternatively, it's possible that clang is, as an extension, trying to produce code that will still work when you use non-standard features to create objects in which only some members can be accessed. The fact that this works for pointers but not for references could be a bug or limitation of that extension, but I don't think it's any sort of conformance violation.

Why does Clang generate different code for reference and non-null pointer arguments?

2 Answers2

An experiment: deref the pointer to get a full tmp object

Linked

Related