3

In intel's processor manual: link in section 8.2.3.4 it is stated that loads may be reordered with earlier stores to different locations, but not with earlier stores to the same location.

So I understand that the following two operations can be reordered:

x = 1;
y = z;

And that the following two operations can not be reordered:

x = 1;
y = x;

But what happens when the store and the load are for different locations, but the load encompasses the store completely, e.g:

typedef union {
  uint64_t shared_var;
  uint32_t individual_var[2];
} my_union_t;

my_union_t var;
var.shared_var = 0;

var.individual_var[1] = 1;
int y = var.shared_var;

So can 'y' in this case be 0?

EDIT (@Hans Passant) To further explain the situation I'm trying to see if I can use this technique to devise a sort of quasi-synchronisation between threads without using locked instructions.

So a more specific question is, given a global variable:

my_union_t var;
var.shared_var = 0;

And two threads executing the following code:

Thread 1:

var.individual_var[0] = 1;
int y = __builtin_popcountl(var.shared_var);

Thread 2:

var.individual_var[1] = 1;
int y = __builtin_popcountl(var.shared_var);

Can 'y' be 1 for both threads?

Note: __builtin_popcountl is the builtin gcc intrinsic for counting number of bits set in a variable.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
syljak
  • 31
  • 3
  • Ordering only plays a role when multiple cores access memory locations. Which isn't visible from your snippet, the posted code can never fail. Having multiple threads read and write the same memory location without synchronization has no practical use. – Hans Passant Aug 05 '12 at 17:27
  • 3
    Thinking about coherency and ordering issues from C complicates matters due to the reording and optimizations possible by the compiler itself. For example, your two threads example will be broken since you haven't told the compiler the union is volatile. – srking Aug 06 '12 at 15:21

2 Answers2

1

Your final and most important question1:

And two threads executing the following code:

Thread 1:

var.individual_var[0] = 1;
int y = __builtin_popcountl(var.shared_var);

Thread 2:

var.individual_var[1] = 1;
int y = __builtin_popcountl(var.shared_var);

Can 'y' be 1 for both threads?

Yes, it can, but it isn't obvious without testing that chips will actually do this since overlapping reads aren't covered in the SDM.

This case is basically a combination of the 8.2.3.4 (store buffering) and 8.2.3.5 (store forwarding) cases. Part of the result may come from the current local store, and the rest of the result has to come from globally visible stores (i.e., from "memory").

Can a CPU give you the result 1 for both threads? Yes - some current Intel CPUs will satisfy part of the load from the store buffer, and the rest of the load from L1, but since neither of the stores has yet become globally visible (still sitting in store buffer), you can get var.iv[0] == 1 && var.iv[1] == 0 on thread 1 and var.iv[0] == 0 && var.iv[1] == 1 on thread 2.

User Alex has actually written test code for this and demonstrated in this very relevant answer. So no, there is no magic trick here: you can't build your own lock-free synchronization like this on all CPUs.

By the way: this may work on some CPUs! In the presence of a partial store-forwarding, some models may take the easy-out and just wait until the store commits to L1, and then read the whole value from L1. In this case, your trick would work... but it ends up not buying you much. You have to wait for the whole store buffer to drain, which is the main cost of a memory fence anyways! So you get the memory fence effect, at the most of a memory-fence-sized stall.


1 The answer to the earlier single-threaded "So can 'y' in this case be 0?" case is obviously "no" - the CPU will maintain the illusion of in-order execution, so if you write something and immediately read it back you'll always see the write (absent other threads writing the same location), regardless of how the write and read overlap.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
0

The CPU doesn't know or care that you've aliased the memory location. As such, the answer to your first question is "No."

The writes in your second example are not synchronized, so, yes, it's possible for the threads to have their own copies of the data.

The answer to the question you didn't ask ("Should I implement and use a custom synchronization primitive?") is "No."

geppy
  • 604
  • 8
  • 10