26

Two different threads within a single process can share a common memory location by reading and/or writing to it.

Usually, such (intentional) sharing is implemented using atomic operations using the lock prefix on x86, which has fairly well-known costs both for the lock prefix itself (i.e., the uncontended cost) and also additional coherence costs when the cache line is actually shared (true or false sharing).

Here I'm interested in produced-consumer costs where a single thread P writes to a memory location, and another thread `C reads from the memory location, both using plain reads and writes.

What is the latency and throughput of such an operation when performed on separate cores on the same socket, and in comparison when performed on sibling hyperthreads on the same physical core, on recent x86 cores.

In the title I'm using the term "hyper-siblings" to refer to two threads running on the two logical threads of the same core, and inter-core siblings to refer to the more usual case of two threads running on different physical cores.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Am I missing something? I believe having the P-C in different cores will make their cache lines switch back and forth between S-M and S-I states respectively. That seems very expensive (especially if no L3 is present) and I think the latency cannot be hidden in the P. if it. uses a `lock` prefix and in the C. if there is only one dep. chain. I think you are very knowledgeable on this and you can surely measure the latency/throughput yourself, so I must miss something to fully understand this question. What is it? :) – Margaret Bloom Aug 10 '17 at 08:44
  • @MargaretBloom - indeed, my plan was to measure it myself if no one jumped it (looks like we got one taker so far!), but I thought it was interesting enough that it could do with a question. You are right that I expect this sharing to be expensive in the inter-core case (although few x86 chips lack L3 these days), but the crux of the question is especially about whether this is really cheap on hyper-siblings, where everything is local. Intuitively, thinking about the hardware _think_ the answer is yes (at least for throughput), but I'm not totally sure. – BeeOnRope Aug 10 '17 at 21:06
  • For example, I am pretty sure that hyper-siblings can't snoop each others store buffer (even though this would be natural from a hardware perspective, it would break a subtle IRIW rule in x86 memory ordering), so the latency is probably bounded by below by how long the store lives in the store buffer. This question originated out of a discussion [over here](https://stackoverflow.com/questions/45556708/creating-a-friendly-timed-busy-loop-for-a-hyperthread?noredirect=1&lq=1#comment78078348_45556708). – BeeOnRope Aug 10 '17 at 21:08
  • @MargaretBloom and Bee: x86 CPUs without a large inclusive L3 are mostly AMD using [MOESI](https://en.wikipedia.org/wiki/MOESI_protocol), so they can forward dirty data between caches instead of syncing through a large inclusive L3. I think I've read that the best-case for sharing between threads on AMD Bulldozer-family can be better than on Intel. I forget what Ryzen is like, but it's different, too. (And of course supports actual SMT). – Peter Cordes Aug 11 '17 at 00:24

2 Answers2

14

Okay, I couldn't find any authoritative source, so I figured I'd give it a go myself.

#include <pthread.h>
#include <sched.h>
#include <atomic>
#include <cstdint>
#include <iostream>


alignas(128) static uint64_t data[SIZE];
alignas(128) static std::atomic<unsigned> shared;
#ifdef EMPTY_PRODUCER
alignas(128) std::atomic<unsigned> unshared;
#endif
alignas(128) static std::atomic<bool> stop_producer;
alignas(128) static std::atomic<uint64_t> elapsed;

static inline uint64_t rdtsc()
{
    unsigned int l, h;
    __asm__ __volatile__ (
        "rdtsc"
        : "=a" (l), "=d" (h)
    );
    return ((uint64_t)h << 32) | l;
}

static void * consume(void *)
{
    uint64_t    value = 0;
    uint64_t    start = rdtsc();

    for (unsigned n = 0; n < LOOPS; ++n) {
        for (unsigned idx = 0; idx < SIZE; ++idx) {
            value += data[idx] + shared.load(std::memory_order_relaxed);
        }
    }

    elapsed = rdtsc() - start;
    return reinterpret_cast<void*>(value);
}

static void * produce(void *)
{
    do {
#ifdef EMPTY_PRODUCER
        unshared.store(0, std::memory_order_relaxed);
#else
        shared.store(0, std::memory_order_relaxed);
#enfid
    } while (!stop_producer);
    return nullptr;
}



int main()
{
    pthread_t consumerId, producerId;
    pthread_attr_t consumerAttrs, producerAttrs;
    cpu_set_t cpuset;

    for (unsigned idx = 0; idx < SIZE; ++idx) { data[idx] = 1; }
    shared = 0;
    stop_producer = false;

    pthread_attr_init(&consumerAttrs);
    CPU_ZERO(&cpuset);
    CPU_SET(CONSUMER_CPU, &cpuset);
    pthread_attr_setaffinity_np(&consumerAttrs, sizeof(cpuset), &cpuset);

    pthread_attr_init(&producerAttrs);
    CPU_ZERO(&cpuset);
    CPU_SET(PRODUCER_CPU, &cpuset);
    pthread_attr_setaffinity_np(&producerAttrs, sizeof(cpuset), &cpuset);

    pthread_create(&consumerId, &consumerAttrs, consume, NULL);
    pthread_create(&producerId, &producerAttrs, produce, NULL);

    pthread_attr_destroy(&consumerAttrs);
    pthread_attr_destroy(&producerAttrs);

    pthread_join(consumerId, NULL);
    stop_producer = true;
    pthread_join(producerId, NULL);

    std::cout <<"Elapsed cycles: " <<elapsed <<std::endl;
    return 0;
}

Compile with the following command, replacing defines:

gcc -std=c++11 -DCONSUMER_CPU=3 -DPRODUCER_CPU=0 -DSIZE=131072 -DLOOPS=8000 timing.cxx -lstdc++ -lpthread -O2 -o timing

Where:

  • CONSUMER_CPU is the number of the cpu to run consumer thread on.
  • PRODUCER_CPU is the number of the cpu to run producer thread on.
  • SIZE is the size of the inner loop (matters for cache)
  • LOOPS is, well...

Here are the generated loops:

Consumer thread

  400cc8:       ba 80 24 60 00          mov    $0x602480,%edx
  400ccd:       0f 1f 00                nopl   (%rax)
  400cd0:       8b 05 2a 17 20 00       mov    0x20172a(%rip),%eax        # 602400 <shared>
  400cd6:       48 83 c2 08             add    $0x8,%rdx
  400cda:       48 03 42 f8             add    -0x8(%rdx),%rax
  400cde:       48 01 c1                add    %rax,%rcx
  400ce1:       48 81 fa 80 24 70 00    cmp    $0x702480,%rdx
  400ce8:       75 e6                   jne    400cd0 <_ZL7consumePv+0x20>
  400cea:       83 ee 01                sub    $0x1,%esi
  400ced:       75 d9                   jne    400cc8 <_ZL7consumePv+0x18>

Producer thread, with empty loop (no writing to shared):

  400c90:       c7 05 e6 16 20 00 00    movl   $0x0,0x2016e6(%rip)        # 602380 <unshared>
  400c97:       00 00 00 
  400c9a:       0f b6 05 5f 16 20 00    movzbl 0x20165f(%rip),%eax        # 602300 <stop_producer>
  400ca1:       84 c0                   test   %al,%al
  400ca3:       74 eb                   je     400c90 <_ZL7producePv>

Producer thread, writing to shared:

  400c90:       c7 05 66 17 20 00 00    movl   $0x0,0x201766(%rip)        # 602400 <shared>
  400c97:       00 00 00 
  400c9a:       0f b6 05 5f 16 20 00    movzbl 0x20165f(%rip),%eax        # 602300 <stop_producer>
  400ca1:       84 c0                   test   %al,%al
  400ca3:       74 eb                   je     400c90 <_ZL7producePv>

The program counts the number of CPU cycles consumed, on consumer's core, to complete the whole loop. We compare the first producer, which does nothing but burn CPU cycles, to the second producer, which disrupts the consumer by repetitively writing to shared.

My system has a i5-4210U. That is, 2 cores, 2 threads per core. They are exposed by the kernel as Core#1 → cpu0, cpu2 Core#2 → cpu1, cpu3.

Result without starting the producer at all:

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3          n/a           2.11G              1.80G

Results with empty producer. For 1G operations (either 1000*1M or 8000*128k).

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3           3            3.20G              3.26G       # mono
    3           2            2.10G              1.80G       # other core
    3           1            4.18G              3.24G       # same core, HT

As expected, since both threads are cpu hogs and both get a fair share, the producer burning cycles slows down consumer by about half. That's just cpu contention.

With producer on cpu#2, as there is no interaction, consumer runs with no impact from the producer running on another cpu.

With producer on cpu#1, we see hyperthreading at work.

Results with disruptive producer:

CONSUMER    PRODUCER     cycles for 1M      cycles for 128k
    3           3            4.26G              3.24G       # mono
    3           2           22.1 G             19.2 G       # other core
    3           1           36.9 G             37.1 G       # same core, HT
  • When we schedule both thread on the same thread of the same core, there is no impact. Expected again, as the producer writes remain local, incurring no synchronization cost.

  • I cannot really explain why I get much worse performance for hyperthreading than for two cores. Advice welcome.

spectras
  • 13,105
  • 2
  • 31
  • 53
  • This measures some throughput, but I'm not convinced it's accurate for a thread-to-thread throughput. There is no guarantee that all reads from the shared variable represent unique messages sent from the producer, and counting the same one again unfairly increases the throughput. In particular `mfence` slows down the producer so much that it definitely won't be writing every cycle (more like every 40th on HSW) but the consumer *can* read every cycle, meaning it might be reading the same thing like 40 times (which is fine but the dupes should not count). – harold Aug 10 '17 at 14:46
  • As @harold points out, I think the `mfence` is the main problem affecting the results here. It comes from a typo: `shared.store(std::memory_order_relaxed);`. The `store()` function takes _two_ arguments, the first is the value to store, the second is the memory order. You are passing the `std::memory_order_relaxed` as the _first_ argument, so it in fact stores the value of that constant with the default ordering, which is `seq_cst`! If you remove that, the `mfence` disappears. – BeeOnRope Aug 10 '17 at 21:28
  • @BeeOnRope> oops sorry I made a mistake while copy pasting, let me check again and correct before I update ^^ – spectras Aug 10 '17 at 22:18
  • @harold> actually the point is not for the consumer to see all values. I created this small program to mimic the behavior BeeOnRope wants to implement in his other, [related question](https://stackoverflow.com/posts/comments/78164692). That is, a thread does some work, and another one updates a value it uses, very often (but no guarantee whatsoever, updates can be missed). – spectras Aug 10 '17 at 22:21
  • 1
    I know, but this is the opposite problem: the consumer sees the same value too many times. If the consumer just sits there and reads the same thing a dozen times, that doesn't represent thread-to-thread throughput, because it's not coming from the other thread most of the time. – harold Aug 10 '17 at 22:24
  • @harold> I'm not measuring thread-to-thread throughput, I'm measuring how much the working thread is disrupted by having another thread update one of its working value. Have a look at the related question :) OP wants to have a secondary thread update a timestamp value a few millions time per second while the first thread uses it, and wants to know the impact of doing that instead of just letting the main thread do the computation. – spectras Aug 10 '17 at 22:27
  • Yeah, that's more or less it (for the other question) - for example, the secondary thread might do a `rdtsc` and save that value to the shared location, or may loop on an instruction like `fsqrt` that is slow but doesn't interfere with the main thread much (as @harold suggested over there). Perhaps this question is a bit too ill-defined then - the more pointless work the producer does here, the more it will compete with the consuming thread (which is the primary thread in the other question), but still I find these initial results interesting in any case. – BeeOnRope Aug 10 '17 at 22:29
  • @BeeOnRope> I fixed the code. Performance degradation is much less in the multi-core case. However, it seems as if it was inverted: I get much, much worse performance for the one-core, two-threads case. I cannot explain it as is, perhaps you can try it yourself or spot another mistake? I checked assembly, mfence is gone in producer and consumer didn't change at all. – spectras Aug 10 '17 at 22:35
  • @spectras - well the "empty producer" result make intuitive sense to me. The `mfence` instructions are slow but don't use many resources (basically they wait for the store buffer to drain). So when the producer code had an `mfence` it had minimal impact on the hyper-sibling, since it was mostly idling. Without `mfence` it is hot and probably competes a lot with the other thread, slowing down each one by about half. The "destructive" producer results are very surprising to me though - more than a 10x regression! – BeeOnRope Aug 10 '17 at 22:47
  • Maybe something weird happens when a load hits in the store buffer of the other logical thread? Perhaps there are memory ordering violations with the shared location and the rest of the memory being iterated over? I'm not sure... – BeeOnRope Aug 10 '17 at 22:47
  • Could the disruptive thread invalidate the cache for its sibling maybe? – spectras Aug 10 '17 at 23:00
  • 2
    You could look at uops_executed vs uops_retired – harold Aug 10 '17 at 23:08
  • 1
    Good idea. Perhaps cache hits and misses would be helpful too. Let's see… – spectras Aug 10 '17 at 23:18
  • `stop_producer` is in the same cache line as `shared`, and probably the end of `data[]`. Is that intentional? I guess the line will stay valid in the producer, and just flip between Modified and Shared, but memory-ordering rules mean there is some interaction between those loads and the stores to `shared`. IDK if this could be throttling the rate at which the producer can invalidate the cache line. – Peter Cordes Aug 11 '17 at 00:39
  • 2
    @harold: Probably also look at `machine_clears.memory_ordering`. Since the consumer doesn't use `pause`, the CPU running the consumer thread probably speculates that it can load `shared` early, and has to roll-back when it discovers that `shared` has a different value by the time its `data[idx]` load completes. (And those loads have to appear to happen in order). One of the reasons it can happen is: `3. cross SMT-HW-thread snoop (stores) hitting load buffer.` according to `ocperf.py list`'s output. (erratum SKL089: it may undercount for gather loads, which doesn't affect this test). – Peter Cordes Aug 11 '17 at 00:42
  • 2
    @PeterCordes> you could be on a good lead. `machine_clear.memory_ordering` is 40M for the 2-core, and 360M for 1-core, 2-thread case. To get a better picture I guess at some point the disrputing thread will have to be rewritten with a fixed number of writes per second. – spectras Aug 11 '17 at 00:55
  • 1
    *By the way, for previous comment, I added 64 bytes of padding in between each global (moving them to a static struct with dummy fields and checking ordering), and get the same results.* – spectras Aug 11 '17 at 01:03
  • 1
    You could have the producer aim for a writes every N `rdtsc` counts, and spin on `rdtsc` when it's ahead of schedule. (Then set the target low enough that it won't get behind and do two stores without spinning?). re: alignment: One easy way is `alignas(128) static atomic ...`. 128 so the L1D adjacent-cache-line prefetcher can't cause a problem. (64 is probably fine, though.) – Peter Cordes Aug 11 '17 at 01:06
  • 1
    I was looking at your original `mfence` version in the edit history. Instead of comparing against an empty producer, try comparing against a producer that writes to a non-shared location. An empty producer might have slightly lower HT impact from not running the store-address and store-data uops. And from the less efficient loop structure (not-taken test/jcc and then a `jmp`.) Maybe change to a `do{}while(!stop_producer)` structure to help the compiler make a simpler asm loop. – Peter Cordes Aug 11 '17 at 01:28
  • 1
    Don't forget to `return nullptr;` in `producer`. clang4.0 is not at all forgiving of missing `return` in non-void functions, and compiles it to an infinite loop! Tweaked version: https://godbolt.org/g/z1heLt – Peter Cordes Aug 11 '17 at 01:40
  • @PeterCordes> updated. I normally use `-Wall -Wextra` but I forgot it there; would have caught that. So, I updated the whole post with your suggestions, except for producer throttling. Dual-core scenario got much lower performance, probably due to tighter producer loop. I have to stop there for now though, but feel free to take what you want from my post and continue in an answer of yours, I'll be glad to see that riddle solved :) – spectras Aug 11 '17 at 10:59
  • @spectras: I have several other things I want to finish experimenting with and write up. Maybe once I finish posting those Q&As and edits to some of my existing answers I'll let myself dive down this rabbit-hole. :P – Peter Cordes Aug 11 '17 at 11:10
11

The killer problem is that the cores makes speculative reads, which means that each time a write to the the speculative read address (or more correctly to the same cache line) before it is "fulfilled" means the CPU must undo the read (at least if your an x86), which effectively means it cancels all speculative instructions from that instruction and later.

At some point before the read is retired it gets "fulfilled", ie. no instruction before can fail and there is no longer any reason to reissue, and the CPU can act as-if it had executed all instructions before.

Other core example

These are playing cache ping pong in addition to cancelling instructions so this should be worse than the HT version.

Lets start at some point in the process where the cache line with the shared data has just been marked shared because the Consumer has ask to read it.

  1. The producer now wants to write to the shared data and sends out a request for exclusive ownership of the cache line.
  2. The Consumer receives his cache line still in shared state and happily reads the value.
  3. The consumer continues to read the shared value until the exclusive request arrives.
  4. At which point the Consumer sends a shared request for the cache line.
  5. At this point the Consumer clears its instructions from the first unfulfilled load instruction of the shared value.
  6. While the Consumer waits for the data it runs ahead speculatively.

So the Consumer can advance in the period between it gets it shared cache line until its invalidated again. It is unclear how many reads can be fulfilled at the same time, most likely 2 as the CPU has 2 read ports. And it properbly doesn't need to rerun them as soon as the internal state of the CPU is satisfied they can't they can't fail between each.

Same core HT

Here the two HT shares the core and must share its resources.

The cache line should stay in the exclusive state all the time as they share the cache and therefore don't need the cache protocol.

Now why does it take so many cycles on the HT core? Lets start with the Consumer just having read the shared value.

  1. Next cycle a write from the Produces occures.
  2. The Consumer thread detects the write and cancels all its instructions from the first unfulfilled read.
  3. The Consumer re-issues its instructions taking ~5-14 cycles to run again.
  4. Finally the first instruction, which is a read, is issued and executed as it did not read a speculative value but a correct one as its in front of the queue.

So for every read of the shared value the Consumer is reset.

Conclusion

The different core apparently advance so much each time between each cache ping pong that it performs better than the HT one.

What would have happened if the CPU waited to see if the value had actually changed?

For the test code the HT version would have run much faster, maybe even as fast as the private write version. The different core would not have run faster as the cache miss was covering the reissue latency.

But if the data had been different the same problem would arise, except it would be worse for the different core version as it would then also have to wait for the cache line, and then reissue.

So if the OP can change some of roles letting the time stamp producer read from the shared and take the performance hit it would be better.

Read more here

Surt
  • 15,501
  • 3
  • 23
  • 39
  • Thanks. The analysis very plausible (indeed, the presence of a large number of "machine clear" events caused by memory ordering pretty much confirms the broad strokes. What about the store buffer though? In the same core examples, the stores go into the store buffer, probably for "some time" which changes the analysis somewhat. See also Peter's comment above about "cross SMT-HW-thread snoop (stores) hitting load buffer." It seems that the stores in the same core case need to snoop the load buffer (a mini coherence protocol within the core), but it isn't clear what happens when this hits. – BeeOnRope Aug 12 '17 at 22:52
  • About checking if the value changed, in the real world the producer won't just be writing zero but an incrementing value that is likely to be different each time it is written, so optimizing for this fake case of always-zero isn't to interesting. On the other hand, the producer is likely to be writing less frequency, perhaps only every 100 cycles or so. Still, I don't see an easy way to avoid the memory order related machine clears even in that case (although they will be less frequent). – BeeOnRope Aug 12 '17 at 22:59
  • @BeeOnRope: The `pause` instruction is supposed to reduce / avoid memory-order mis-speculation when leaving a spin-loop. Perhaps you could use it before reading a shared flag even when you weren't going to spin on it. Pre-Skylake, it only pauses for ~5 cycles, so you might actually come out ahead for very frequent producer updates in this synthetic case. Maybe `pause` before every 2 loads, or something, since the consumer can probably satisfy at least 2 loads in the same cycle. The load buffer has many more entries than that, but IDK if each entry always needs its own cache-read cycle. – Peter Cordes Aug 13 '17 at 01:53
  • There's probably a sweet spot for how many loads per `pause` before you start getting memory-ordering machine clears, but it probably wont be the same as the number of loads of a single address that can be satisfied in the same cycle as the first one after a mis-speculation. (I was going to say after the cache line arrives, but in the HT-sibling case it stays valid in the core's L1D.) There's probably a higher number of loads-per-`pause` that gives you the optimal throughput, with some but not many machine clears. (Based on my totally-made-up guess about how `pause` works.) – Peter Cordes Aug 13 '17 at 01:55
  • Yes, it had cross my find but it's too bad it has to be on the consumer side, which is the side I don't want to slow down. I can't think of anything that would help on the producer side though, except some kind of "unordered store" but the ones we have (NT stores) imply flushing the line out of L1 (and beyond). Probably I should spend effort on making a more realistic case with both less dense read and less dense writes, but it seems like the machine clears won't go away really, only reduce in frequency. – BeeOnRope Aug 13 '17 at 02:26
  • 1
    @BeeOnRope> So in the end, I guess what would be really interesting for your issue is to compare the performance hit of having another core disrupt the main one versus simply letting the main one do the computation. If it's simple enough, most probably you will get best performance from just letting the one thread do everything. I guess just try both ways and bench them, there is no way my synthetic test can come close to be as accurate as benching your actual code :) – spectras Aug 17 '17 at 13:26
  • @spectras - yes, exactly. The impact of the intra-core load buffer snoop was unexpected to me, and very big for your test, but I don't know how it will play out for writes that are more spaced out. – BeeOnRope Aug 21 '17 at 01:23
  • @BeeOnRope> I was thinking the fact that it's memory ordering issue will probably introduce some jitter. In the worst case, main thread reads the value right when the other writes it, incurring full performance impact as in my test. On the other hand, if the read of the value is not in-flight at the time the writer thread changes it, there should be no performance impact at all. Funnily, this means the potential impact will get larger with newer processors capable of more speculative executions. – spectras Aug 21 '17 at 10:49
  • I wonder if there is enough use cases for making an load instruction that just reads the current value and doesn't look back. The current situation makes all inter thread communication extremely expensive. – Surt Aug 21 '17 at 10:57