volatile increments with false sharing run slower in release than in debug when 2 threads are sharing the same physical core

Question

I'm trying to test the performance impact of false sharing. The test code is as below:

constexpr uint64_t loop = 1000000000;

struct no_padding_struct {
    no_padding_struct() :x(0), y(0) {}
    uint64_t x;
    uint64_t y;
};

struct padding_struct {
    padding_struct() :x(0), y(0) {}
    uint64_t x;
    char padding[64];
    uint64_t y;
};

alignas(64) volatile no_padding_struct n;
alignas(64) volatile padding_struct p;

constexpr core_a = 0;
constexpr core_b = 1;

void func(volatile uint64_t* addr, uint64_t b, uint64_t mask) {
    SetThreadAffinityMask(GetCurrentThread(), mask);
    for (uint64_t i = 0; i < loop; ++i) {
        *addr += b;
    }
}

void test1(uint64_t a, uint64_t b) {
    thread t1{ func, &n.x, a, 1<<core_a };
    thread t2{ func, &n.y, b, 1<<core_b };

    t1.join();
    t2.join();
}

void test2(uint64_t a, uint64_t b) {
    thread t1{ func, &p.x, a, 1<<core_a  };
    thread t2{ func, &p.y, b, 1<<core_b  };

    t1.join();
    t2.join();
}

int main() {
    uint64_t a, b;
    cin >> a >> b;


    auto start = std::chrono::system_clock::now();
    //test1(a, b);
    //test2(a, b);
    auto end = std::chrono::system_clock::now();
    cout << (end - start).count();
}

The result was mostly as follow:

x86                                         x64             
cores    test1           test2              cores       test1        test2  
         debug  release  debug  release               debug release  debug  release
0-0      4.0s   2.8s     4.0s   2.8s        0-0       2.8s  2.8s     2.8s   2.8s
0-1      5.6s   6.1s     3.0s   1.5s        0-1       4.2s  7.8s     2.1s   1.5s
0-2      6.2s   1.8s     2.0s   1.4s        0-2       3.5s  2.0s     1.4s   1.4s
0-3      6.2s   1.8s     2.0s   1.4s        0-3       3.5s  2.0s     1.4s   1.4s
0-5      6.5s   1.8s     2.0s   1.4s        0-5       3.5s  2.0s     1.4s   1.4s

test result in image

My CPU is intel core i7-9750h. 'core0' and 'core1' are of a physical core, and so does 'core2' and 'core3' and others. MSVC 14.24 was used as the compiler.

The time recorded was an approximate value of the best score in several runs since there were tons of background tasks. I think this was fair enough since the results can be clearly divided into groups and 0.1s~0.3s error did not affect such division.

Test2 was quite easy to explain. As x and y are in different cache lines, running on 2 physical core can gain 2 times performance boost(the cost of context switch when running 2 threads on a single core is ignorable here), and running on one core with SMT is less efficient than 2 physical cores, limited by the throughput of coffee-lake(believe Ryzen can do slightly better), and more efficient than temporal multithreading. It seems 64bit mode is more efficient here.

But the result of test1 is confusing to me. First, in debug mode, 0-2, 0-3, and 0-5 are slower than 0-0, which makes sense. I explained this as certain data was moved from L1 to L3 and L3 to L1 repeatedly since the cache must stay coherent among 2 cores, while it would always stay in L1 when running on a single core. But this theory conflicts with the fact that 0-1 pair is always the slowest. Technically, the two threads should share the same L1 cache. 0-1 should run 2 times as fast as 0-0.

Second, in release mode, 0-2, 0-3, and 0-5 were faster than 0-0, which disproved the theory above.

Last, 0-1 runs slower in release than in debug in both 64bit and 32bit mode. That's what I can't understand most. I read the generated assembly code and did not find anything helpful.

Check performance counters for `machine_clear.memory_ordering` (e.g. using VTune or whatever else is usable on Windows). [What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?](https://stackoverflow.com/q/45602699) is similar. Separate cores might be letting more stores build up in the store buffer, less frequent machine clears. Related: [Why does false sharing still affect non atomics, but much less than atomics?](//stackoverflow.com/q/61672049) (non-atomic RMW can store-forward without having the line) — Peter Cordes, May 20 '20 at 20:20
Also semi-related: [C++ latency increases when memory ordering is relaxed](https://stackoverflow.com/q/61649951) — Peter Cordes, May 20 '20 at 20:21
Your `volatile uint64_t* addr` increments are non-atomic, just forcing store/reload that creates a store-forwarding bottleneck like in [Why does false sharing still affect non atomics, but much less than atomics?](https://stackoverflow.com/q/61672049). So yeah, I expect that Hyperthreading actually hurts because it makes memory ordering machine clears much more frequent, instead of only happening when the cache line actually ping pongs between phys cores. — Peter Cordes, May 20 '20 at 20:25

score 1 · Answer 1 · answered May 21 '20 at 07:49

@PeterCordes Thank you for your analysis and advice. I finally profiled the program using Vtune and it turns out your expectation was correct.

When running on SMT threads of the same core, machine_clear consumes lots of time, and it was more severe in Release than in Debug. This happens on both 32bit and 64bit mode.

When running on different physical cores the bottleneck was memory(store latency and false sharing)，and Release was always faster since it contains significantly fewer memory access than Debug in critical part, as shown in Debug assembly(godbolt) and Release assembly(godbolt). The total instruction retired is also fewer in Release, which strengthens this point. It seems the assembly I found in Visual Studio yesterday was not correct.

score -1 · Answer 2 · answered May 20 '20 at 19:47

-1

This might be explained by hyper-threading. Cores being shared as 2 hyperthread cores do not get double the throughout like 2 entirely separate cores might. Instead you might get something like 1.7 times the performance.

Indeed, your processor has 6 cores and 12 threads, and core0/core1 are 2 threads on the same underlying core, if I am reading all this correctly.

In fact, if you picture in your mind how hyper-threading works, with the work of 2 separate cores interleaved, it is not surprising.

answered May 20 '20 at 19:47

Sean F

4,344
16
30

Yes, that's true in general, and the OP's data shows that. e.g. comparing release mode between the `0-1` case (HT contention) vs the `0-2` case (separate phys cores) shows that there is a speedup. But that's not what the question is about; it's about release vs. debug mode when two threads *are* sharing a physical core. The rest of the data is just useful testing of other conditions. – Peter Cordes May 20 '20 at 19:50
Yes, @PeterCordes in the title the person emphasized debug vs releases, but in the discussion compared pairings of some cores to other cores. – Sean F May 20 '20 at 19:54
You're ignoring the fact that `test1` has both threads modifying the same cache line. `test2` doesn't, so the results are as expected, and the question already says "*running on one core with SMT is less efficient than 2 physical cores*". You're only answering the part already answered in the question itself. Also notice that in release-mode, there's more than 2x difference between the `0-1` and `0-2` cases, so simple contention for front-end throughput / back-end execution ports is not a sufficient explanation. (Instead presumably memory ordering machine clears) – Peter Cordes May 20 '20 at 20:02

volatile increments with false sharing run slower in release than in debug when 2 threads are sharing the same physical core

2 Answers2