73

In a codebase I reviewed, I found the following idiom.

void notify(struct actor_t act) {
    write(act.pipe, "M", 1);
}
// thread A sending data to thread B
void send(byte *data) {
    global.data = data;
    notify(threadB);
}
// in thread B event loop
read(this.sock, &cmd, 1);
switch (cmd) {
    case 'M': use_data(global.data);break;
    ...
}

"Hold it", I said to the author, a senior member of my team, "there's no memory barrier here! You don't guarantee that global.data will be flushed from the cache to main memory. If thread A and thread B will run in two different processors - this scheme might fail".

The senior programmer grinned, and explained slowly, as if explaining his five years old boy how to tie his shoelaces: "Listen young boy, we've seen here many thread related bugs, in high load testing, and in real clients", he paused to scratch his longish beard, "but we've never had a bug with this idiom".

"But, it says in the book..."

"Quiet!", he hushed me promptly, "Maybe theoretically, it's not guaranteed, but in practice, the fact you used a function call is effectively a memory barrier. The compiler will not reorder the instruction global.data = data, since it can't know if anyone using it in the function call, and the x86 architecture will ensure that the other CPUs will see this piece of global data by the time thread B reads the command from the pipe. Rest assured, we have ample real world problems to worry about. We don't need to invest extra effort in bogus theoretical problems.

"Rest assured my boy, in time you'll understand to separate the real problem from the I-need-to-get-a-PhD non-problems."

Is he correct? Is that really a non-issue in practice (say x86, x64 and ARM)?

It's against everything I learned, but he does have a long beard and a really smart looks!

Extra points if you can show me a piece of code proving him wrong!

Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
mikebloch
  • 1,577
  • 11
  • 21
  • 5
    Of course he's right, he knows from experience. He could have mentioned that writing or reading a pipe or socket always involves taking a lock in the kernel, which implies a barrier, but proving that to a young whipper-snapper takes a lot of time. – Hans Passant May 22 '12 at 08:33
  • 1
    @HansPassant but even those syscalls can, in pathological cases, end up running on a different core than the one who called them, issuing the memory barrier on the wrong core, can't they? – mikebloch May 22 '12 at 08:44
  • Code does not pathologically jump from one thread to another without a synchronization primitive. – Hans Passant May 22 '12 at 08:48
  • 1
    @HansPassant, can't the kernel decide to move syscall to another thread for performance from time to time? Can't this happen exactly before the syscall? – mikebloch May 22 '12 at 09:02
  • Your are not given sufficient information for a complete answer of that problem. This is system and C standard version dependent. Also, it has not much to do with the processor architecture of your platform. And other than suggested it has nothing to do if your function call is a system call or not. – Jens Gustedt May 22 '12 at 10:06
  • @JensGustedt Show me a modern C compiler and modern processor architecture where it doesn't apply - and you earn the jackpot. – mikebloch May 22 '12 at 12:03
  • @mikebloch, I don't think that this is the point. The guarantees that these platforms give and the reasoning why it would be possible to rely on such a behavior would be different. In my answer I have given you the POV of C11 which is quite recent and basically not yet implemented. Where I think that POSIX basically has the same requirements, I wouldn't know how to argue for other OS. – Jens Gustedt May 22 '12 at 14:01
  • 3
    Question 1: Does a memory barrier "flush cache to main memory" or just guarantee "all writes have been made - to the cache"? Don't the cache coherency mechanisms kick in to handle cache races between cores? Question 2: How long does a processor delay a write? Are we talking 10 machine instructions or 1000? Is this growing and going to continue to grow? I ask because there are many hundreds or even thousands of machine instructions about to execute in that call to notify(). – johnnycrash May 25 '12 at 16:23
  • Although very late for discussion. I think the write syscall is injecting the memory-barrier on behalf of the user-space code. If notification was done by some memory-barrier unaware user-space mechanism, you should have seen errors like unprocessed data or crash because of global.data is NULL or re-processing of previous global.data... – Malkocoglu Jul 31 '15 at 09:28
  • 2
    `the fact you used a function call is effectively a memory barrier, the compiler will not reorder the instruction global.data = data` Barriers aren't for the compiler, they're for the hardware. – brettwhiteman Nov 04 '15 at 01:23
  • 1
    I promised you an upvote three and a half years ago, back when Stackoverflow used to have a limit on the number of votes per day (or something). I have come back, 1261 days later, to fulfill this promise. Have an upvote on me. +1. – Mahmoud Al-Qudsi Nov 04 '15 at 21:33

4 Answers4

13

Memory barriers aren't just to prevent instruction reordering. Even if instructions aren't reordered it can still cause problems with cache coherence. As for the reordering - it depends on your compiler and settings. ICC is particularly agressive with reordering. MSVC w/ whole program optimization can be, too.

If your shared data variable is declared as volatile, even though it's not in the spec most compilers will generate a memory variable around reads and writes from the variable and prevent reordering. This is not the correct way of using volatile, nor what it was meant for.

(If I had any votes left, I'd +1 your question for the narration.)

Mahmoud Al-Qudsi
  • 28,357
  • 12
  • 85
  • 125
  • 2
    But is that really an issue in the `x86`/`x64`. Can I write a short program that demonstrate it failing? (and thanks for the kind words, technical discussion should be fun). – mikebloch May 22 '12 at 08:24
  • 1
    x86 makes some guarantees with regards to cache coherence. x64 does not, but in reality Intel realizes devs wrote crappy, unsafe code for x86 and as such, even though they are not obligated to and it isn't in the spec, perform many operations atomically and also perform cache synchronization. With ARM though, all bets are off. See this post (though it's not x86-specific) for a lot more info and some more tongue-in-cheek narration: http://ridiculousfish.com/blog/posts/barrier.html – Mahmoud Al-Qudsi May 22 '12 at 08:27
  • 1
    did you just said that Intel *rewards* developers who write bad code, and ignore documentation, by investing R&D resources to solve the problem they wrecked upon themselves? Probably on the expense on my and your CPU efficiency? Man, some days I wonder why I'm even trying that hard. – mikebloch May 22 '12 at 08:32
  • 2
    @mike they tried *not* to with the Itanic, and we all know how successful they were there. Then AMD came along and said "here's a 64-bit platform that will run your x86 binaries *and* run your crappy code just recompiled for x64 without fixing your mistakes" and thus x86_64 was born. – Mahmoud Al-Qudsi May 22 '12 at 08:33
  • While the volatile keyword will not guarantee memory barriers nor thread-safety, it will however protect a multi-threaded app from bugs related to the compiler performing incorrect optimizations since it notices that your thread callback functions aren't called anywhere in your code. This is probably unlikely to happen on modern compilers for x86, but far more likely on low-level embedded compilers. – Lundin May 22 '12 at 11:00
  • *All* ISAs guarantee that all CPUs share a coherent view of memory (at least for the cores that we want to run multiple threads across). x86 and x86-64 have the same asm/hardware memory-ordering model: program order + a store-buffer with store forwarding. https://preshing.com/20120930/weak-vs-strong-memory-models/. Intel's manuals don't even make a distinction: "x86" (32-bit mode) is just a sub-mode of x86-64 (called "legacy mode"). Current x86-64 CPUs have just as much runtime memory reordering as the last 32-bit-only CPUs like Pentium III and Pentium-M. – Peter Cordes Apr 08 '20 at 10:45
11

In practice, a function call is a compiler barrier, meaning that the compiler will not move global memory accesses past the call. A caveat to this is functions which the compiler knows something about, e.g. builtins, inlined functions (keep in mind IPO!) etc.

So a processor memory barrier (in addition to a compiler barrier) is in theory needed to make this work. However, since you're calling read and write which are syscalls that change the global state, I'm quite sure that the kernel issues memory barriers somewhere in the implementation of those. There is no such guarantee though, so in theory you need the barriers.

janneb
  • 36,249
  • 2
  • 81
  • 97
  • Depends on your compiler. ICC will wreak havoc with incorrect/unsafe code as it'll tend to optimize the heck out of everything, analyzing code from all files and modules, rewriting and reordering across functions, and more. – Mahmoud Al-Qudsi May 22 '12 at 08:25
  • 3
    So in practice, kernel-mode code == memory barrier? Sounds reasonable, and sounds that the old man was right after all. It doesn't sound that ICC will be able to reorder code around system call, since he doesn't know what the kernel will do. – mikebloch May 22 '12 at 08:27
  • @mikebloch: I wouldn't state that as strongly, but if you think about it syscalls which allow information to be passed between threads potentially on different CPU's need barriers for correct operation. – janneb May 22 '12 at 08:32
  • Just do not name it "function call". It's "system call". Functions can be contained in your program and optimized, system calls are external. – blaze May 22 '12 at 08:34
  • 1
    @janneb yeah, but even those syscalls can, in pathological cases, end up running on a different core than the one who called them, issuing the memory barrier on the wrong thread. – mikebloch May 22 '12 at 08:34
  • 1
    @blaze: As I tried to explain in the second sentence, it's function calls that the compiler cannot peek into somehow, that the compiler must assume can touch global state. From the compiler perspective, a syscall is no different than, say, a function in a shared library (with no information available beyond the function prototype). – janneb May 22 '12 at 08:37
  • @janneb: I just want to err on safe side: if this is plain function, it's hard to predict when compiler can see its definition and when it can't. Even shared library do not promise much since this specific function can header-only inline. Syscalls are guaranteed to be unknown to compiler. – blaze May 22 '12 at 09:43
  • 1
    As far as I know, the compiler is not allowed to change the order of instruction execution past the the call, since there is a C language sequence point after the function parameter evaluation but before the call. So before the function is called, I believe the compiler must be done fiddling around with instruction re-ordering. – Lundin May 22 '12 at 11:08
  • 1
    @Lundin: Well, the compiler can do any transformation which retains the original program semantics (externally visible side-effects, if you will). So if it can prove that an expression is pure (no side effects) or that the side-effects don't matter, it can ignore the sequence point and reorder operations. In practice I'm not sure that reordering anything past an external function call (with no information about the function internals) buys you much, so it wouldn't surprise me if compilers wouldn't bother with it. – janneb May 23 '12 at 06:02
  • 2
    This is actually the only direct answer to the original questions. The answer is: No. A function call is always a compiler barrier. But a function call is not guaranteed to be a memory barrier. It only is if the code in the called function contains a memory barrier. – Johannes Overmann Nov 04 '15 at 20:41
  • I propose we assert that syscalls include a barrier, or at least syscalls that copyout (like read) include an acquire barrier (aka "later reads can't go earlier") and syscalls that copyin (like write) include a release barrier (aka "earlier writes can't be issued after"), because I really don't want to have to maintain my own asm barrier library, with processor-specific implementations. If we're wrong, kernel/syscall shim maintainers can always tell us by editing/commenting, and we'll get notified. – Pierre Lebeaupin Nov 22 '19 at 10:40
2

The basic rule is: the compiler must make the global state appear to be exactly as you coded it, but if it can prove that a given function doesn't use global variables then it can implement the algorithm any way it chooses.

The upshot is that traditional compilers always treated functions in another compilation unit as a memory barrier because they couldn't see inside those functions. Increasingly, modern compilers are growing "whole program" or "link time" optimization strategies which break down these barriers and will cause poorly written code to fail, even though it's been working fine for years.

If the function in question is in a shared library then it won't be able to see inside it, but if the function is one defined by the C standard then it doesn't need to -- it already knows what the function does -- so you have to be careful of those also. Note that a compiler will not recognise a kernel call for what it is, but the very act of inserting something that the compiler can't recognise (inline assembler, or a function call to an assembler file) will create a memory barrier in itself.

In your case, notify will either be a black box the compiler can't see inside (a library function) or else it will contain a recognisable memory barrier, so you are most likely safe.

In practice, you have to write very bad code to fall over this.

ams
  • 24,923
  • 4
  • 54
  • 75
2

In practice, he's correct and a memory barrier is implied in this specific case.

But the point is that if its presence is "debatable", the code is already too complex and unclear.

Really guys, use a mutex or other proper constructs. It's the only safe way to deal with threads and to write maintainable code.

And maybe you'll see other errors, like that the code is unpredictable if send() is called more than one time.

amadvance
  • 61
  • 1