Race condition when accessing adjacent members in a shared struct, according to CERT coding rule POS49-C?

Question

According to CERT coding rule POS49-C it is possible that different threads accessing different fields of the same structure may conflict.

Instead of bit-field, I use regular unsigned int.

struct multi_threaded_flags {
  unsigned int flag1;
  unsigned int flag2;
};

struct multi_threaded_flags flags;

void thread1(void) {
  flags.flag1 = 1;
}

void thread2(void) {
  flags.flag2 = 2;
}

I can see that even unsigned int, there can still be racing condition IF compiler decides to use load/store 8 bytes instead of 4 bytes. I think compiler will never do that and racing condition will never happen here, but that's completely just my guess.

Is there any well-defined assembly/compiler documentation regarding this case ? I hope locking, which is costly, is the last resort when this situation happens to be undefined.

FYI, I use gcc.

See @Mystical's comment to this question: https://stackoverflow.com/questions/46916696/on-a-64-bit-machine-can-i-safely-operate-on-individual-bytes-of-a-64-bit-quadwo. Basically, as long as you're not racing over the same byte, you should be fine. — MFisherKDX, Oct 30 '17 at 04:36
No the compiler should not generate code to write 8 bytes, as that also requires a read-operation and bit masking (to get the value of the "flag" you're not assigning to) which takes up valuable instructions. As long as no other thread is accessing these fields then you're safe. — Some programmer dude, Oct 30 '17 at 04:38
http://port70.net/~nsz/c/c11/n1570.html#5.1.2.4p27 in C11... not guaranteed necessarily before. — Antti Haapala -- Слава Україні, Oct 30 '17 at 04:46
Related: https://stackoverflow.com/questions/46522451/why-is-gcc-allowed-to-speculatively-load-from-a-struct. — Lundin, Oct 30 '17 at 07:35
@Lundin: only slightly related. *loading* is always fine, but turning a 4-byte store into an 8-byte non-atomic read-modify-write is not fine. e.g. code that always stores one member, but only stores the second member inside an `if()` can't legally be optimized to asm that loads both at once and stores back with the same value if the `if()` condition is false. The question you linked is only about speculative loads of whole objects. — Peter Cordes, Oct 30 '17 at 07:45
@PeterCordes Yes I know, but it shows that the compiler _may_ read the whole struct. — Lundin, Oct 30 '17 at 07:51
Also, [this footnote for speculative reads](http://port70.net/~nsz/c/c11/n1570.html#5.1.2.4p28) — Antti Haapala -- Слава Україні, Oct 30 '17 at 08:24
@AnttiHaapala imho, they are so hard to read. I can't find definition of "potentially shared memory". If I'm not mistaken, it means that these cases are defined-safe in C11 standard but undefined to any standard before that, right ? — Bankde, Oct 30 '17 at 08:53
@Bankde Earlier versions of the standard didn't have a memory model for concurrent accesses and therefore no need to address this situation. POSIX didn't address it either and as far as I know, at least early Alpha machines didn't actually have instructions to read/write individual bytes, leaving no way out of this kind of problem. — fuz, Oct 30 '17 at 08:54
While it looks as it is defined in C++11 as separate, on real platform you may hit its own constraints. For example on modern x86 these two writes from two different threads will trigger the 64B same memory block collision, so the two cores involved will have to sync the write to that memory area, and in the end you will get roughly same performance as single thread. So unless these flags are some kind of once-per-thousands of the workload, don't do that (for example two threads working over big array interleaving ints are 2x worse than two threads working on halves of big array). — Ped7g, Oct 30 '17 at 09:16
In practice, most multi-threaded code did assume this before C11, and C11 mostly just standardized what compilers were already doing, and what code already depended on. See p2 of Herb Sutter's talk on C++11 (https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/); he mentioned that some compilers did have bugs that introduced non-atomic RMW where the source didn't have one, and that was already a problem before C++11 but the main difference is that when you report it you can point out that it's a standards violation. (C11 is the same as C++11 for this). — Peter Cordes, Oct 30 '17 at 09:24
@Ped7g: Two threads ping-ponging a cache line is *far* worse than a single thread. (except when the store buffer saves you from some of the badness: https://stackoverflow.com/questions/46919032/why-does-using-the-same-cache-line-from-multiple-threads-not-cause-serious-slowd) — Peter Cordes, Oct 30 '17 at 09:26

BeeOnRope · Accepted Answer · 2017-11-15T05:21:27.187

10

The C11 memory model guarantees that accesses to distinct structure members (which aren't part of a bit-field) are independent, so you'll run into no problems modifying the two flags from different threads (i.e., the "load 8 bytes, modify 4, and write back 8" scenario is not allowed).

This guarantee does not extend in general to bitfields, so you have to be careful there.

Of course, if you are concurrently modifying the same flag from more than one thread, you'll likely trigger the prohibition against data races, so don't do that.

edited Nov 15 '17 at 05:21

answered Oct 30 '17 at 08:27

BeeOnRope

60,350
16
207
386

Exactly what I need. Thank you. Out of curiosity, is concurrently modifying the same flag by setting to const value (exactly same in example) can also cause data races ? – Bankde Oct 30 '17 at 09:15
1

Based on the standard, probably yes. Admittedly, I'm going by my knowledge of the C++ standard, which definitely forbids it, but I'm not 100% sure of the corresponding language in C11. _In practice_ it would be fine on most hardware. If you really need to do it, ask another question. In C++ the answer is to use `memory_order_relaxed` writes to a `std::atomic` object if you want to do that, and the performance ends up the same. – BeeOnRope Oct 30 '17 at 09:19
1

*"so you'll run into no problems"* - no problem with coherence of data, but performance wise he will very likely run into problems on most of the platforms. As he is trying to reduce mutex_lock, he may easily end with solution which has surprisingly similar (bad) performance as the lock variant (if the lock variant was written in optimal way). Designing data structures and algorithm for particular target platform is still essential for *full* performance, the abstract C++ machine can't cover that, as it's abstract and general. – Ped7g Oct 30 '17 at 09:24
2

@Ped7g maybe, who knows? Since it's a flag setting, perhaps they are set infrequently and performance is fine for a mostly read-only load. Perhaps performance isn't a concern on the scale of nanoseconds here. If the OP is curious about this, he is free to ask another question I suppose. I agree with the rest of your comment, but it doesn't seem particularly relevant here (if every question about what is legal in a language also had to include all aspects of performance, the answers and comments would never end...). – BeeOnRope Oct 30 '17 at 09:25
Thanks for both comment, very insightful. – Bankde Oct 30 '17 at 10:03
1

*"I hope locking, which is costly, is the last resort when this situation happens to be undefined."* - he wants to avoid performance cost, so probably some warning it doesn't work that simply is in place. Then again the OP doesn't specify platform and compiler, so it's quite meaningless to reason about performance implications. In hypothetical general sense there may be platform+compiler for which the lock solution is actually fastest, so already the OP assumption in question is wrong. (I don't have real problem with your answer, I'm just trying to warn OP by extending it) – Ped7g Oct 30 '17 at 10:03
1

To be clear, I think it _does work_ for scenarios which are "read mostly" which is exactly what flag setting often is. A flag may be set a few times and read billions of times. Is that the scenario? I don't know - but it is very common for flags (many are set only once). @Ped7g – BeeOnRope Oct 30 '17 at 10:20
Apparently my assumption regarding performance is not quite correct. Please don't argue here, I will create new question on this topic later. Original topic is just to make sure about racing condition. Very much thank to both of you. Is there anything I should read first ? e.g. related stackoverflow questions, hidden platform locking, cpu/mem cache, etc ? – Bankde Oct 30 '17 at 11:44
All of the above, I guess? It a complex topic, I only recommend asking small, specific questions to prevent off-topic comments :). – BeeOnRope Oct 30 '17 at 11:56
@BeeOnRope starting from probably related questions in stackoverflow so I don't create duplicate one ? :) – Bankde Oct 30 '17 at 15:59

score 1 · Answer 2 · answered Nov 15 '17 at 00:10

Before C11, ISO C had nothing to say about threads, and writing multi-threaded code relied on other standards (e.g. POSIX which defines a memory model for pthreads), and multi-threaded code essentially depended on the way real compilers worked.

Note that this CERT coding-standard rule is in the POSIX section, and appears to be about pthreads without C11. (There's a CON32-C. Prevent data races when accessing bit-fields from multiple threads rule for C11, where they solve the bit-field concurrency problem by simply promoting the bit-fields to unsigned char, which C11 defines as "separate memory locations". This rule appears to be an insufficiently-edited copy of that, because many of its suggestions suck.)

But unfortunately POSIX pthreads doesn't clearly define what a "memory location is", and this is all they have to say on the subject:

Single UNIX® Specification, Version 4, 2016 Edition (online HTML copy, requires free registration)

4.12 Memory Synchronization

Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads.

This is why C11 defines it more clearly, where only bitfields are dangerous to write from different threads (barring compiler bugs).

However, I think everyone (including all compilers) agreed that separate int variables / struct members / array elements were separate "memory locations". Most real-world software doesn't take any special precautions for int or char variables that may be written by separate threads (especially outside of structs).

A compiler that gets int wrong will cause problems all over the place unless the bug is limited to very specific circumstances. Most bugs like this are very hard to detect with testing, because usually the other data that's non-atomically loaded and stored back isn't written by another thread very often / ever. But if a compiler always did that for every int, problems would show up in some software pretty quickly.

Normally, separate char members would also be considered separate "memory locations", but some pre-C11 implementations might have exceptions to that rule. (Especially on early Alpha AXP which famously has no byte store instruction (so a C11 implementation would have to use 32-bit char), but optimizations that invent writes when updating multiple members can happen anywhere, either by accident or because the compiler developers define "memory location" as a 32 or 64-bit word.)

There's also the issue of compiler bugs. This can affect even compilers that intend to conform to C11. For example gcc bug 52080 which affected some non-x86 architectures. (Discovered in gcc4.7 in 2012, fixed in gcc4.8 a couple months later). Using a bitfield "tricked" the compiler into doing a non-atomic read-modify-write of the containing 64-bit word, even though that included a non-bitfield member. (Bitfields are bait for compiler bugs. Any defensive / safe-coding standard should recommend avoiding them in structs where different members can be modified from different threads. And definitely don't put them next to the actual lock.)

Herb Sutter's talk atomic<> Weapons: The C++ Memory Model and Modern Hardware part 2 goes into some detail about the kinds of compiler bugs that have affected multi-threaded code. Most of these should be shaken out by now (2017) if you're using a modern compiler. Most things like inventing writes (or non-atomic read and write-back of the same value) were usually still considered bugs before C11; C11 mostly just firmed up the rules compilers were already trying to follow. It also made it easier to report such bugs, because you could say unequivocally that it violates the standard instead of just "it breaks my code".

That coding-rule article is poorly written. Its examples with adjacent bit-fields are unsafe, but it claims that all variables are at risk. This is not true in general, especially not with C11. Many users of pthreads can or already do compile with C11 compilers.

(The phrasing I'm referring to is "Bit-fields are especially prone to this behavior", which incorrectly implies that this is allowed to happen with ordinary members of structs, or variables that happen to be adjacent outside of structs)

It's part of a defensive coding standard, but it should definitely make the distinction between what the standards require and what is just belt-and-suspenders defense against compiler bugs.

Also, putting variables that will usually be accessed by different threads into one struct is generally terrible. False sharing of a cache line (typically 64 bytes) is really bad for performance, causing cache misses and (on out-of-order x86 CPUs) memory-ordering mis-speculation (like a branch mispredict requiring a roll-back.) Putting separately-used shared variables into the same byte with bit-fields is even worse, because it prevents efficient stores (any store has to be a RMW of the containing byte).

Solving the bit-field problem by promoting the two bit-fields to unsigned char makes much more sense than using a mutex if they need to be independently writeable from separate threads. Or even unsigned long if you're paranoid.

If the two members are often used together, it makes sense to put them nearby. But if you're going to pad to a whole long to contain both members (like that article does), you might as well make them at least unsigned char or bool instead of 1-byte bitfields.

Although honestly, having two threads modify separate members of a struct at the same time seems like poor design unless one of the members is the lock, and the modification is part of an attempt to take the lock. Using a bit-field as a lock is a bad idea unless you're writing for a specific ISA building and your own lock primitive using something like x86's lock bts instruction to atomically test-and-set a bit. Even then it's a bad idea unless you need to pack it with other bitfields for space saving; the Linux code that exposed the gcc bug with an int lock:1 member was a horrible idea.

In addition, the flags are declared volatile to ensure that the compiler will not attempt to move operations on them outside the mutex.

If your compiler needs this, your compiler is seriously broken, and will create broken code for most multi-threaded programs. (Unless the compiler bug only happens with bit-fields, because shared bit-fields are rare).

Most code doesn't make shared variables volatile, and relies on the guarantee that mutex lock/unlock stops operations from reordering at compile or run time out of the critical section.

Back in 2012, and possibly still today, gcc -pthread might affect code-gen choices in C89/C99 mode (-std=gnu99). In discussion on an LWN article about that gcc bug, this user claimed that -pthread would prohibit the compiler from doing a 64-bit load/store when modifying a 32-bit variable, but that without -pthread it could do so (although on most architectures, IDK why it would). But it turns out that gcc bug manifested even with -pthread, so it was really a bug rather than an aggressive optimization choice.

ISO C11 standard:

N1570, section 3.14 definitions:

memory location: either an object of scalar type, or a maximal sequence of adjacent bit-fields all having nonzero width

NOTE 1 Two threads of execution can update and access separate memory locations without interfering with each other.

A bit-field and an adjacent non-bit-field member are in separate memory locations. ... It is not safe to concurrently update two non-atomic bit-fields in the same structure if all members declared between them are also (non-zero-length) bit-fields, no matter what the sizes of those intervening bit-fields happen to be.

(...gives an example of a struct with bit-fields...)

So in C11, you can't assume anything about the compiler munging other bit-fields when writing one bit-field, but otherwise you're safe. Unless you use a separator :0 field to force the compiler pad enough (or use atomic bit-ops) so that it can update your bit-field without concurrency problems for other fields. But if you want to be safe, it's probably not a good idea to use bit-fields at all in structs that are written by multiple threads at once.

See also other Notes in the C11 standard, e.g. the one linked by @Fuz in Is it well-defined behavior to modify one element of an array while another thread modifies another element of the same array? that explicitly says that compiler transformations that would make this dangerous are disallowed.

I started this when the question was posted, but it was getting too long and messy. It's still not great but I think good enough to post now. — Peter Cordes, Nov 15 '17 at 00:11

Race condition when accessing adjacent members in a shared struct, according to CERT coding rule POS49-C?

2 Answers2

ISO C11 standard:

Linked