Why does GCC use mov/mfence instead of xchg to implement C11's atomic_store?

Question

In C++ and Beyond 2012: Herb Sutter - atomic<> Weapons, 2 of 2 Herb Sutter argues (around 0:38:20) that one should use xchg, not mov/mfence to implement atomic_store on x86. He also seems to suggest that this particular instruction sequence is what everyone agreed one. However, GCC uses the latter. Why does GCC use this particular implementation?

This answer in a previous question pretty much covers it: http://stackoverflow.com/a/22283062/3826372 — Ross Ridge, Jul 26 '14 at 05:42

score 3 · Accepted Answer · edited Apr 06 '16 at 13:26

3

Quite simply, the mov and mfence method is faster as it does not trigger a redundant memory read like the xchg which will take time. The x86 CPU guarantees strict ordering of writes between threads anyway so so it is enough.

Note some very old CPUs have a bug in the mov instruction which makes xchg necessary but this is from a very long time ago and working around this is not worth the overhead to most users.

Credit to @amdn for the information on the bug in old Pentium CPUs causing xchg to be needed in the past.

edited Apr 06 '16 at 13:26

gluk47

1,812
20
31

answered Aug 12 '14 at 12:32

Vality

6,577
3
27
48

I'm curious as to why Herb Sutter argues (in 2012) for using `xchg` then, since he seems knowledgable on the subject. – tibbe Aug 12 '14 at 20:17
1

@tibbe My personal guess would be that `xchg` was better at some point in the past, I have found various bits of dated docs that `mfence` + `mov` was slow on some Athlons and broken on some early Pentiums. Possibly the advice was sensible once but was simply a little dated, mind he may have first done the research some time before the talk and it may have been on dated CPUs even then, I could see it being possible that he could have come to that conclusion. However pretty much all documentations say that the `mov` `mfence` path is faster on modern CPUs if not very old ones. – Vality Aug 12 '14 at 22:00
1

`mov`+`mfence` is slower in practice; GCC recently switched away from it for seq_cst stores, now using `xchg` like other compilers. It already needs to get a copy of the cache line into Exclusive ownership state (except for full-line writes, committing a store requires an up-to-date copy of the line), so the extra read is not at all significant. See also [Does lock xchg have the same behavior as mfence?](https://stackoverflow.com/q/40409297) and note the Skylake microcode update that gave `mfence` an `lfence`-like barrier to OoO exec, unlike `xchg`. – Peter Cordes Apr 08 '22 at 13:51
@PeterCordes This may well be true now. It's been about a decade since I last investigated this (look at the answer age). I don't have time right now to update this answer but if you can write a more up to date one (saying more or less what you said in your comment) I am sure OP would appreciate it and I would happily upvote. – Vality Apr 10 '22 at 00:51
1

There are existing Q&As that cover it, probably we can close this as a duplicate. I had another look for some, and found [What is the difference in logic and performance between LOCK XCHG and MOV+MFENCE?](https://stackoverflow.com/q/19096112) which itself has some duplicate links. My answer on [Which is a better write barrier on x86: lock+addl or xchgl?](https://stackoverflow.com/a/52910647) at least mentions it. Ah, finally found a good answer about the perf diff: [Why does a std::atomic store with sequential consistency use XCHG?](https://stackoverflow.com/q/49107683) – Peter Cordes Apr 10 '22 at 00:59

Why does GCC use mov/mfence instead of xchg to implement C11's atomic_store?

1 Answers1

Linked