This is a missed optimization; the branchless code is safe for both C++ source versions.
In Why is gcc allowed to speculatively load from a struct? GCC actually is speculatively loading both struct members through a pointer even though the C source only references one or the other. So at least GCC developers have decided that this optimization is 100% safe, in their interpretation of the C and C++ standards (I think that's intentional, not a bug). Clang generates a 0 or 1 index to choose which int
to load, so clang is still just as reluctant as in your case to invent a load. (C vs C++: same asm with or without -xc
, with a version of the source ported to work as either: https://godbolt.org/z/6oPKKd)
The obvious difference in your asm is that the pointer version avoids access to a->y
if a->x != 0
, and that this only matters for correctness1 if a->y
was in an unmapped page; you're right about that being the relevant corner case.
But ISO C++ doesn't allow partial objects. The page-boundary setup in your example is I'm pretty sure undefined behaviour. In a path of execution that reads a->x
, the compiler can assume it's safe to also read a->y
.
This would of course not be the case for int *p;
and p[0] || p[1]
, because it's totally valid to have an implicit-length 0-terminated array that happens to be 1 element long, in the last 4 bytes of a page.
As @Nate suggested in comments, perhaps clang simply doesn't take advantage of that ISO C++ fact when optimizing; maybe it does internally transform to something more like an array by the time it's considering this "if-conversion" type of optimization (branchy to branchless). Or maybe LLVM just don't let itself invent loads through pointers.
It can always do it for reference args because references are guaranteed non-NULL. It would be "even more" UB for the caller to do nonzero_ref(*ppt)
, like in your partial-object example, because in C++ terms we're dereferencing a pointer to the whole object.
An experiment: deref the pointer to get a full tmp object
bool nonzero_ptr_full_deref(Point const* pa) {
Point a = *pa;
return a.x || a.y;
}
https://godbolt.org/z/ejrn9h - compiles branchlessly, same as nonzero_ref
. Not sure what / how much this tells us. This is what I expected, given that it makes access to a->y
effectively unconditional in the C++ source.
Footnote 1: Like all mainstream ISAs, x86-64 doesn't do hardware race detection, so the possibility of loading something another thread might be writing only matters for performance, and then only if the full struct is split across a cache-line boundary since we're already reading one member. If the object doesn't span a cache line, any false-sharing performance effect is already incurred.
Making asm like this doesn't "introduce data-race UB" because x86 asm has well-defined behaviour for this possibility, unlike ISO C++. The asm works for any possible value loaded from [rdi+4]
so it correctly implements the semantics of the C++ source. Inventing reads is thread-safe, unlike writes, and is allowed because it's not volatile
so the access isn't a visible side-effect. The only question is whether the pointer must point to a full valid Point
object.
Part of data races (on non-atomic
objects) being Undefined Behaviour is to allow for C++ implementations on hardware with race detection. Another is to allow compilers to assume that it's safe to reload something they accessed once, and expect the same value unless there's an acquire or seq_cst load between the two points. Even making code that would crash if the 2nd load differed from the first. That's irrelevant in this case because we're not talking about turning 1 access into 2 (instead 0 into 1 whose value may not matter), but is why roll-your-own atomics (e.g. in the Linux kernel) need to use volatile*
casts for ACCESS_ONCE
(https://lwn.net/Articles/793253/#Invented%20Loads).