87

A colleague showed me code that I thought wouldn't be necessary, but sure enough, it was. I would expect most compilers would see all three of these attempts at equality tests as equivalent:

#include <cstdint>
#include <cstring>

struct Point {
    std::int32_t x, y;
};

[[nodiscard]]
bool naiveEqual(const Point &a, const Point &b) {
    return a.x == b.x && a.y == b.y;
}

[[nodiscard]]
bool optimizedEqual(const Point &a, const Point &b) {
    // Why can't the compiler produce the same assembly in naiveEqual as it does here?
    std::uint64_t ai, bi;
    static_assert(sizeof(Point) == sizeof(ai));
    std::memcpy(&ai, &a, sizeof(Point));
    std::memcpy(&bi, &b, sizeof(Point));
    return ai == bi;
}

[[nodiscard]]
bool optimizedEqual2(const Point &a, const Point &b) {
    return std::memcmp(&a, &b, sizeof(a)) == 0;
}


[[nodiscard]]
bool naiveEqual1(const Point &a, const Point &b) {
    // Let's try avoiding any jumps by using bitwise and:
    return (a.x == b.x) & (a.y == b.y);
}

But to my surprise, only the ones with memcpy or memcmp get turned into a single 64-bit compare by GCC. Why? (https://godbolt.org/z/aP1ocs)

Isn't it obvious to the optimizer that if I check equality on contiguous pairs of four bytes that that's the same as comparing on all eight bytes?

An attempt to avoid separately booleanizing the two parts compiles somewhat more efficiently (one fewer instruction and no false dependency on EDX), but still two separate 32-bit operations.

bool bithackEqual(const Point &a, const Point &b) {
    // a^b == 0 only if they're equal
    return ((a.x ^ b.x) | (a.y ^ b.y)) == 0;
}

GCC and Clang both have the same missed optimizations when passing the structs by value (so a is in RDI and b is in RSI because that's how x86-64 System V's calling convention packs structs into registers): https://godbolt.org/z/v88a6s. The memcpy / memcmp versions both compile to cmp rdi, rsi / sete al, but the others do separate 32-bit operations.

struct alignas(uint64_t) Point surprisingly still helps in the by-value case where arguments are in registers, optimizing both naiveEqual versions for GCC, but not the bithack XOR/OR. (https://godbolt.org/z/ofGa1f). Does this give us any hints about GCC's internals? Clang isn't helped by alignment.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ben
  • 9,184
  • 1
  • 43
  • 56
  • 1
    There might have alignment issue... – Jarod42 Feb 18 '21 at 15:48
  • At the cost of 2 memcpys? – Mansoor Feb 18 '21 at 15:48
  • 6
    @M.A No. See the assembly output in the supplied link. – Drew Dormann Feb 18 '21 at 15:49
  • I don't have any experience with clang's optimization output, but it suspiciously says "<2 x i32> is unsupported by vectorizer" – dyp Feb 18 '21 at 15:58
  • 14
    How about `return std::memcmp(&a, &b, sizeof(a)) == 0;` ? It generates the same assembly as the optimized version and is more expressive. – Aykhan Hagverdili Feb 18 '21 at 16:00
  • Huh clang uses SSE registers and a `vpcmpeqd` for `naiveEqual` given a recent enough architecture... – dyp Feb 18 '21 at 16:02
  • 3
    @dyp: Wow, yeah, and pointlessly expands the compare result to two 64-bit elements with a `vpmovsxdq` / `vmovmskpd` instead of just using `vmovmskps` / `cmp al, 0xf` (the top 2 bits will always be set because the high zeros in the `pcmpeqd` input will compare equal). Or even `vpmovmskb`; the low 8 bits are all we need. Of course pure scalar is clearly better here, but if it was looking for something like `a.x==b.x && a.y != b.y`, you could do *that* with clang's SIMD strategy just using a different compare value, like `0x1` in the low 2 bits instead of `0x3`. – Peter Cordes Feb 19 '21 at 01:27
  • 2
    I think you can take a look at the file gcc/fold-const.c (the optimization happens very early when you add alignas), function fold_truth_andor_1. I would guess the issue is where it calls get_best_mode, but I didn't check. – Marc Glisse Feb 19 '21 at 01:48
  • 5
    for C++20 `return std::bit_cast(a) == std::bit_cast(b);` is the type safe version of `memcpy` / `memcmp` and it generates the same optimized assembly, – bolov Feb 19 '21 at 01:56
  • I *would* be surprised if the compiler merged this into a single `cmpq` - a logical `AND` uses a [short-circuit evaluation](https://en.wikipedia.org/wiki/Short-circuit_evaluation). It does the minimum required for the first expression. I'm not even convinced that what you *expect*, is optimal; given `(2^32)` values for `(x)`, what's the probability that `(a.x == a.y)` without special compiler hints? What are the costs of loading, comparing 32-bit values vs. 64-bit values? – Brett Hale Feb 19 '21 at 08:54
  • @bolov: True, that would give you the equivalent of `static_assert(sizeof(a) == sizeof(b))`, but no more type checking than that. Any two different types of the same size as `int64_t` can be compared that way, including a `Point*` and a `Point` in this case on 64-bit systems. – Peter Cordes Feb 19 '21 at 08:59
  • 4
    @BrettHale: That reasoning is very faulty. For example, `x < 10 && x > 1` optimizes into a sub / cmp / setbe (unsigned below or equal) range-check https://godbolt.org/z/G8h3eM. GCC is certainly willing to consider doing work the C abstract machine wouldn't, especially if it can get it all done without any more instructions. (Including if-conversion from branchy source to branchless asm). One answer even points out that GCC actually does do the desired optimization if you promise it alignment of `Point`. – Peter Cordes Feb 19 '21 at 09:03
  • @PeterCordes - I'm not arguing with the logic. I'm arguing about the compiler heuristics for a particular case. The alternative implementation (hints) may tip the compiler into another 'state'. You've a better understanding of low-level stuff than I do - could a 32-bit load / cmp be considered 'faster', even in optimized code? – Brett Hale Feb 19 '21 at 09:15
  • 2
    @BrettHale: Your first sentence seems to be arguing that compilers won't try to optimize in ways unconditionally do work that's short-circuited in the source. That's what I was showing wasn't true in general. Yes, it comes down to the details of the case. A 64-bit *aligned* load is basically the same cost as one 32-bit load. But without alignment, yes it's certainly possible for it to be slower from a cache-line split, even ~100 cycles slower if split across a 4k page on a CPU before Skylake. But that's probably unlikely. And that excuse falls apart for the by-val reg-arg case at the bottom – Peter Cordes Feb 19 '21 at 09:18
  • @PeterCordes - your example would be more instructive if it didn't use constants; but I see your point. It's a missed optimization in any realistic case. – Brett Hale Feb 19 '21 at 09:32
  • @BrettHale: Another thing to keep in mind re: compiler heuristics: most `int` values aren't uniformly randomly distributed (law of small numbers), and if `==` is "interesting" enough for code to check for it, there's a significant chance it happens often. Agreed that overall this optimization is very likely worth doing even for possibly-unaligned memory in most use-cases. (Especially since most such 64-bit objects probably will be aligned, e.g. in a normally-allocated array of Point[].) – Peter Cordes Feb 19 '21 at 09:38
  • @PeterCordes - I think I was trying to guess why it 'missed' - which was presumptuous considering I've not looked at the compiler internals for this. I don't know if the compiler makes any assumptions wrt bit-distribution patterns; e.g., float vs int distribution. – Brett Hale Feb 19 '21 at 09:50
  • There is a chance that the standard forbids the optimization. The compiler is not allowed to cause a data race. For example an optimization changing `if (con) x++;` to `x++; if (!con) x--;` (let's just assume the second is faster for some reason) is not allowed, because in the case of accessing `x` in another thread while `con` is false is not a data race, but if the optimization happens it is. Similarly for the 2 `int`s the `naiveEqual` does not touch `y` if `x` is already not equal. Doing a 64 bit compare would add a read of `y` that wasn't there before, possibly causing a data race. – nwp Feb 19 '21 at 11:39
  • Unfortunately I don't know if the standard says that the default struct comparison must behave like `naiveEqual` or if `naiveEqual1` is allowed. Also the target platform might not care for that data race, so it might be allowed under "as if". Since I don't actually know if this is the answer I'll just leave it as a comment. Maybe someone does the hard work of checking the standard and posts it as an answer. – nwp Feb 19 '21 at 11:41
  • 1
    @PeterCordes As I say in my answer, beware the store-load forwarding cliff. – EOF Feb 19 '21 at 14:38
  • @PeterCordes it also checks that the types are TriviallyCopyable. It is completely type safe. A `bit_cast` is always legal. – bolov Feb 19 '21 at 18:34
  • "if I check equality on contiguous pairs of four bytes that that's the same as comparing on all eight bytes". Citation needed. Are those eight bytes aligned? See https://blog.quarkslab.com/unaligned-accesses-in-cc-what-why-and-solutions-to-do-it-properly.html . – Eric Towers Feb 19 '21 at 22:18
  • 2
    In case you are not aware, you could ask on https://gcc.gnu.org/bugzilla/ . Then the compiler might be improved, or you might get an authoritative answer as to why this shouldn't happen. – Marc Glisse Feb 20 '21 at 10:36
  • @nwp: The C++ source isn't allowed to contain a data race, but the compiler *can* introduce *reads* in asm on ISAs where it doesn't fault (no HW race detection). It's not at all the same as inventing writes; as you say that's not allowed. But reads are, as long as it just gives an unpredictable value without disturbing the behaviour of anything *else* (unlike C++ UB), it's fine. See my answer on [the followup Q&A about pointer instead of reference](//stackoverflow.com/a/66299499), especially the footnote. That's why GCC and clang do legally do optimizations like this in related cases. – Peter Cordes Feb 21 '21 at 11:36
  • @MarcGlisse: Ah I see. Deleting my previous comment. – Peter Cordes Feb 21 '21 at 12:27

3 Answers3

48

If you "fix" the alignment, all give the same assembly language output (with GCC):

struct alignas(std::int64_t) Point {
    std::int32_t x, y;
};

Demo

As a note, some correct/legal ways to do some stuff (as type punning) is to use memcpy, so having specific optimization (or be more aggressive) when using that function seems logical.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jarod42
  • 203,559
  • 14
  • 181
  • 302
  • 5
    But memcpy doesn't assume alignment... so the optimizedEqual doesn't assume that Point is overaligned – dyp Feb 18 '21 at 15:53
  • 6
    So... why does the memcpy version not need alignment? The compiler sees through the memcpy in that it copies the unaligned structs to registers... is this a missing compiler optimization that the alignment somehow nudges? – Ben Feb 18 '21 at 15:54
  • 17
    This is an interesting observation, but I don't feel that it answers the "Why?" _Why are these valid, trivial, and equivalent functions producing different assembly?_ – Drew Dormann Feb 18 '21 at 15:54
  • 1
    Maybe the optimizers think that the two comparisons will be faster in case of misalignment, and _this_ optimization is broken by memcpy?? – dyp Feb 18 '21 at 15:55
  • FWIW, if you have an aligned and unaligned `Point` and implement comparison of unaligned as a copy to aligned and compare, that gets optimized: https://godbolt.org/z/b38xox – Ben Feb 18 '21 at 15:58
  • @Ben `Point` is by default aligned to 4-bytes, but those local `ai,bi` are aligned to 8-bytes because they are `std::uint64_t`, so 64-bit `cmp` instruction can be used. – Quimby Feb 18 '21 at 16:01
  • So is the idea that in isolation like this, I get what I get, but when these functions get inlined, things might change? Is loading a 4-byte-aligned block of 8 bytes into an 8-byte register more expensive than if it's 8-byte aligned? – Ben Feb 18 '21 at 16:08
  • 10
    So, why does the alignment matter here? Why can't the compiler do the optimization OP did manually? – Aykhan Hagverdili Feb 18 '21 at 16:10
  • FWIW, the big three compilers all seem to do the same thing. – Ben Feb 18 '21 at 16:14
  • 3
    You can just write `alignas(std::uint64_t)`. – eerorika Feb 18 '21 at 16:32
  • 17
    @AyxanHaqverdili: guaranteed alignment means the optimization is even more profitable: no chance of cache-line splits when using single 64-bit loads. This might make the optimizer try harder, or bump a heuristic past some threshold of profitability. But without knowing which, this answer is just a useful observation and a workaround, not a real answer. – Peter Cordes Feb 19 '21 at 01:14
  • 2
    @Ben Whether it is more expensive to load a 4byte aligned or 8 byte aligned block of 8 bytes is processor dependent, but the general answer is yes. On x86/x64, unaligned reads are allowed to be slower (basically turning them into a pair of reads). On some architectures (like 68000 if I recall), unaligned reads are illegal, generating a "bus exception." – Cort Ammon Feb 19 '21 at 03:38
  • A lengthy back-and-forth discussion between Krazy Glew and supercat was [migrated to chat](https://chat.stackoverflow.com/rooms/228977/discussion-between-krazy-glew-and-supercat). Please note that comments are not for extended discussions. They should be reserved for suggesting improvements to and/or asking for clarification on the associated post. – Cody Gray - on strike Feb 19 '21 at 23:40
  • When a tangential discussion is moved to chat by a moderator, please don't intentionally subvert that by reposting the exact same comments. – Cody Gray - on strike Feb 20 '21 at 16:32
33

There's a performance cliff you risk falling off of when implementing this as a single 64-bit comparison:

You break store to load forwarding.

If the 32-bit numbers in the structs are written to memory by separate store instructions, and then loaded back from memory with 64-bit load instructions quickly (before the stores hit L1$), your execution will stall until the stores commit to globally visible cache coherent L1$. If the loads are 32-bit loads that match the previous 32-bit stores, modern CPUs will avoid the store-load stall by forwarding the stored value to the load instruction before the store reaches cache. This violates sequential consistency if multiple CPUs access the memory (a CPU sees its own stores in a different order than other CPUs do), but is allowed by most modern CPU architectures, even x86. The forwarding also allows much more code to be executed completely speculatively, because if the execution has to be rolled back, no other CPU can have seen the store for the code that used the loaded value on this CPU to be speculatively executed.

If you want this to use 64-bit operations and you don't want this perf cliff, you may want to ensure the struct is also always written as a single 64-bit number.

EOF
  • 6,273
  • 2
  • 26
  • 50
  • 2
    Why does that change with alignment? – dyp Feb 19 '21 at 15:30
  • @dyp Alignment isn't the (main) problem, atomicity is. It's reasonably easy (wild handwaving) to make a store buffer (required to get any reasonable performance in a cache-coherent multiprocessor system) that can snoop later load instructions (required to avoid breaking single-thread semantics). It's also relatively easy (it's been done on x86 for a long time, because of the scarcity of registers, especially pre-x64) to *forward* the value if it came from a single store instruction. But to take bits from *multiple* store instructions, combine them and forward them, would be *difficult* indeed. – EOF Feb 19 '21 at 15:34
  • 1
    What I meant was: why is the optimization performed if additional alignment is given? Does that somehow change your argument? I mean, it could cross a cache line w/o the alignment, but does it influence store->load fwd? – dyp Feb 19 '21 at 15:45
  • @dyp It doesn't, except that some (especially older) machines may not do store-load forwarding if the store was misaligned. In that case, forcing 64-bit stores will not avoid the performance degradation. – EOF Feb 19 '21 at 15:49
  • 2
    *your execution will stall until the stores commit to globally visible cache coherent L1$* - Not quite. There's evidence that a Store-forwarding stall on modern x86 CPUs doesn't have to wait for commit, it just has to do a slower more complete scan of the store buffer, possibly also merging with data from L1d. [Can modern x86 implementations store-forward from more than one prior store?](https://stackoverflow.com/a/46145326) has some more detail on that evidence. It's also not a pipeline stall, OoO exec may be able to hide the latency. But yes, good point, usually something to avoid. – Peter Cordes Feb 19 '21 at 21:45
  • 2
    But IIRC, I've been told by GCC devs that GCC doesn't know anything about store-forwarding stalls and doesn't actively try to avoid them. (Devs do, so that doesn't rule out tuning some heuristics for cost/benefit of doing wider loads, though.) – Peter Cordes Feb 19 '21 at 21:48
  • @PeterCordes re:"non-fast-path forwarding". Do you mean store-forward stall or are there fast/slow paths for a successful store forward? [Can modern x86 implementations store-forward from more than one prior store?](https://stackoverflow.com/questions/46135766/can-modern-x86-implementations-store-forward-from-more-than-one-prior-store#:~:text=%C2%A7%2010.12%20Store%20forwarding%20stalls,same%20address%2C%20regardless%20of%20alignment.) seems certain that two 2-byte stores + 4 byte load should be a store-forward stall. (Tested and saw ```ld_blocks.store_forward``` incrementing). – Noah Feb 22 '21 at 08:36
  • @PeterCordes what are the 15c and 16c refering to? 15c for L1d store, store, load vs 16c for non-fast path forwarding? – Noah Feb 22 '21 at 08:38
  • 1
    @Noah: Read the comments in my Godbolt link. 2 stores dependent on the load that both have to be reloaded (instead of the reload reading 1 store + merging data from L1d cache) is slower because of the resource conflict: 2 stores that have to write data to the store buffer. – Peter Cordes Feb 22 '21 at 08:42
  • 1
    @Noah: Re: "non-fast-path" - yes, I mean a store-forwarding stall. But it's still forwarding without necessarily waiting for data to be written to L1d$. Possibility #2 in my answer on [Can modern x86 implementations store-forward from more than one prior store?](https://stackoverflow.com/a/46145326) which I still think is how it works, but I haven't tested by putting cache-miss stores before the store-forwarding. – Peter Cordes Feb 22 '21 at 08:43
21

Why can't the compiler generate [same assembly as memcpy version]?

The compiler "could" in the sense that it would be allowed to.

The compiler simply doesn't. Why it doesn't is beyond my knowledge as that requires deep knowledge of how the optimiser has been implemented. But, the answer may range from "there is no logic covering such transformation" to "the rules aren't tuned to assume one output is faster than the other" on all target CPUs.

If you use Clang instead of GCC, you'll notice that it produces same output for naiveEqual and naiveEqual1 and that assembly has no jump. It is same as for the "optimised" version except for using two 32 bit instructions in place of one 64 bit instruction. Furthermore restricting the alignment of Point as shown in Jarod42's answer has no effect to the optimiser.

MSVC behaves like Clang in the sense that it is unaffected by the alignment, but differently in the sense that it doesn't get rid of the jump in naiveEqual.

For what its worth, the compilers (I checked GCC and Clang) produce essentially same output for the C++20 defaulted comparison as they do fornaiveEqual. For whatever reason, GCC opted to use jne instead of je for the jump.

is this a missing compiler optimization

With the assumption that one is always faster than the other on the target CPUs, that would be fair conclusion.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • 3
    clang with `-march=tigerlake` uses SSE. – dyp Feb 18 '21 at 17:20
  • 3
    Also interesting: When I replace my `Point` with `std::tuple` or `std::pair` I get the same behavior... but `std::array` is a single compare even though all three are (usually, I expect!) the same bits in memory with the same alignment. – Ben Feb 18 '21 at 18:58
  • 3
    @Ben gcc does that array optimization, but clang doesn't... – dyp Feb 18 '21 at 19:25
  • @dyp. Interesting. So clearly someone/something thinks it's a good idea, but somehow it fails to realize it in the tuple, pair, and struct case. – Ben Feb 18 '21 at 20:18
  • As noted by Krazy Glew, the substitution is *not* valid, since it might affect program behavior if the leading portions of both structures could always be accessed, and the later portions could always be accessed if the earlier portions matched, but the later portions might not be accessible in some cases where the leading portions don't match. – supercat Feb 19 '21 at 18:16
  • 2
    @supercat: As I [commented](https://stackoverflow.com/questions/66263263/why-cant-gcc-generate-an-optimal-operator-for-a-struct-of-two-int32s/66279126?noredirect=1#comment117192100_66263393) in that thread, that's incorrect. C structs are all-or-nothing, unlike separate indexes relative to a pointer. Accessing `a.x` guarantees that `a.y` is accessible. – Peter Cordes Feb 20 '21 at 02:22
  • @PeterCordes: Threading semantics are defined if one thread reads `a.x` while another accesses `a.y`. For the first thread to read `a.y`, however, would invoke Undefined Behavior even if the first thread ignores the value read. In a non-sanitizing build, eagerly performing the read of the second word ahead of the comparison would be valid *if it were done in a way that precluded any incompatible downstream optimizations*, but from what I can tell different parts of clang and gcc optimization are handled by different people who don't have a consensus understanding about... – supercat Feb 20 '21 at 13:18
  • ...what intermediate-language constructs either need to be treated as defined by downstream passes (even though they're not defined in C) or else avoided by upstream passes (even though they would have defined behavior at the machine level). If an intermediate language wouldn't have a sound way of representing an optimized version of the code whose behavior would be defined in *all* cases where the original was, a correct optimizer must refrain from making that optimization, even if it would be useful in *almost* all cases. – supercat Feb 20 '21 at 13:25
  • 2
    @supercat: How is there any problem here? If the first 32 bits don't match, the `==` compare will be false no matter what garbage you read in the 2nd 32 bits. x86 doesn't have hardware race detection so it won't fault. Or are you talking about hypothetical badness on other ISAs, from GCC's target-independent optimizations doing this without properly checking that the target can't do race detection? Does GCC support any targets with HW race detection? – Peter Cordes Feb 20 '21 at 22:07
  • @PeterCordes: My concern is with optimizers or sanitizers. Neither gcc nor clang respects the principle that if an ISA would define a corner case behavior, implementations intended for low-level programming on that platform should do likewise, nor do the developers of different optimization stages seem to have a consistent idea of what operations are and are not defined at intermediate stages. – supercat Feb 21 '21 at 03:03
  • 2
    @supercat: I'm aware of that argument / position you take on stuff like strict aliasing or signed-overflow. But I don't see how it applies here. You think it should be possible for a `struct Point &` reference to be to an object that's only half present, i.e. extending into an unmapped page? (Or are you still talking about data races?) That unmapped page argument makes some sense for a `struct Point *` pointer (although it's not in practice how C works), but the caller would have to have a pointer pointing at a partially-present struct, and do `foo(*p)` which looks like the whole thing. – Peter Cordes Feb 21 '21 at 03:24
  • Related: [Why does Clang generate different code for reference and non-null pointer arguments?](https://stackoverflow.com/q/66298438) is a followup question to this (and my answer there points out that the extra read is only a possible perf problem). Also, [Why is gcc allowed to speculatively load from a struct?](https://stackoverflow.com/q/46522451) shows that GCC in fact does read both members in one case where the C source only reads one or the other, with answers and comments debating the clarity of the standard on the point. (Marc Glisse points out it's not as clear as we'd like.) – Peter Cordes Feb 21 '21 at 11:46