bytes aligned and false sharing cause performance diff on x86-64

Question

env : x86-64; linux-centos; 8-cpu-core
For testing 'false sharing performance' I wrote c++ code like this:

volatile int32_t a;
volatile int32_t b;
int64_t p1[7];
volatile int64_t c;
int64_t p2[7];
volatile int64_t d;

void thread1(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        a = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 1 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread2(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        b = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 2 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread3(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        c = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 3 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

void thread4(int param) {
    auto start = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 1000000000; ++i) {
        d = i % 512;
    }
    auto end = chrono::high_resolution_clock::now();
    cout << " 4 cost:" << chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() << endl;
}

here is my compile cmd : g++ xxx.cpp --std=c++11 -O0 -lpthread -g so there is no opt(O0)

I print a、b、c、d virtual addr are

a addr 0x406200
b addr 0x406204
c addr 0x406258
d addr 0x406298

here is execute result:

 4 cost:2186474910
 3 cost:6114449628
 1 cost:7464439728
 2 cost:7469428696

what I understood, there is no 'cache bouncing' Or 'false sharing' problem in thread3 with other thread, so why it's slower than thread 4?

addition: if I change int32_t a,b to int64_t a,b, the result changes to:

a addr 0x4061e0
b addr 0x4061e8
c addr 0x406238
d addr 0x406278
3 cost:2188341526
4 cost:2193782423
2 cost:6479324727
1 cost:6645607256

which is what I predict

I calculate first case addr 0x406200 0x406204 0x406258 0x406298 10-decimalism are 4219392 4219396 4219480 4219544 . 4219392 is multiple of 64, so it — Ryan Gao, Nov 08 '21 at 07:34
Why would you use `-O0` and limit it to only 1 store per ~6 clock cycles, bottlenecked on store-forwarding latency of the loop counter? You're using `volatile` on the actual stores you care about. Are you intentionally benchmarking code that has some dependent loads to trigger possible memory-order mis-speculation? — Peter Cordes, Nov 08 '21 at 07:36
L2 spatial prefetch might be causing some interference for thread C; it's in the same 128-byte aligned pair of cache lines as A and B. (Unlike D). What specific CPU model do you have? Intel has a spatial prefetcher that tries to complete adjacent lines (so for current CPUs an appropriate value for `std::hardware_destructive_interference_size` would be 128, but only 64 for `std::hardware_constructive_interference_size`); IDK about AMD's prefetchers. — Peter Cordes, Nov 08 '21 at 07:44
this is independent and complete code for testing performance about cache fake-sharing problem, I use -O0 because if I dont, loop maybe opt out by g++, there is no other code except main() call thread t1(thread1, 1); to thread t4(thread4, 4) and join and printf a to d — Ryan Gao, Nov 08 '21 at 07:48
`a = i % 512;` can't be optimized out because `a` is `volatile`. That's the whole point of using `volatile` here: every assignment to it is a visible side effect that the optimizer must respect. (With `-O0`, [*everything* is treated sort of like `volatile`.](https://stackoverflow.com/questions/53366394/why-does-clang-produce-inefficient-asm-with-o0-for-this-simple-floating-point)) — Peter Cordes, Nov 08 '21 at 07:52
yes u'r right about O0, I change to -O1 and disassemble code, the loop is still there and cost change to `a addr 0x40429c b addr 0x404298 c addr 0x404258 d addr 0x404200 4 cost:528467443 3 cost:532451691 1 cost:652654952 2 cost:654210170` then I will check hardware_destructive_interference_size — Ryan Gao, Nov 08 '21 at 08:14
Oh, the variables changed address, I think in reverse order. Put them in a struct and align the struct by 128 (or 4096) if you want to control for that. — Peter Cordes, Nov 08 '21 at 08:16
I am not exactly sure if the standard specifies how variables should be ordered in memory with regard to the order of their declarations (with the exceptions of non-static member variables with the same access rights, which does not apply here). In such a case, you cannot make any assumptions about "distances" of storage of your variables in memory. The correct solution would be to enforce stricter alignment such as with the `alignas` specifier. — Daniel Langr, Nov 08 '21 at 08:35

bytes aligned and false sharing cause performance diff on x86-64

0 Answers0

Linked