I'm trying to test the performance impact of false sharing. The test code is as below:
constexpr uint64_t loop = 1000000000;
struct no_padding_struct {
no_padding_struct() :x(0), y(0) {}
uint64_t x;
uint64_t y;
};
struct padding_struct {
padding_struct() :x(0), y(0) {}
uint64_t x;
char padding[64];
uint64_t y;
};
alignas(64) volatile no_padding_struct n;
alignas(64) volatile padding_struct p;
constexpr core_a = 0;
constexpr core_b = 1;
void func(volatile uint64_t* addr, uint64_t b, uint64_t mask) {
SetThreadAffinityMask(GetCurrentThread(), mask);
for (uint64_t i = 0; i < loop; ++i) {
*addr += b;
}
}
void test1(uint64_t a, uint64_t b) {
thread t1{ func, &n.x, a, 1<<core_a };
thread t2{ func, &n.y, b, 1<<core_b };
t1.join();
t2.join();
}
void test2(uint64_t a, uint64_t b) {
thread t1{ func, &p.x, a, 1<<core_a };
thread t2{ func, &p.y, b, 1<<core_b };
t1.join();
t2.join();
}
int main() {
uint64_t a, b;
cin >> a >> b;
auto start = std::chrono::system_clock::now();
//test1(a, b);
//test2(a, b);
auto end = std::chrono::system_clock::now();
cout << (end - start).count();
}
The result was mostly as follow:
x86 x64
cores test1 test2 cores test1 test2
debug release debug release debug release debug release
0-0 4.0s 2.8s 4.0s 2.8s 0-0 2.8s 2.8s 2.8s 2.8s
0-1 5.6s 6.1s 3.0s 1.5s 0-1 4.2s 7.8s 2.1s 1.5s
0-2 6.2s 1.8s 2.0s 1.4s 0-2 3.5s 2.0s 1.4s 1.4s
0-3 6.2s 1.8s 2.0s 1.4s 0-3 3.5s 2.0s 1.4s 1.4s
0-5 6.5s 1.8s 2.0s 1.4s 0-5 3.5s 2.0s 1.4s 1.4s
My CPU is intel core i7-9750h
. 'core0' and 'core1' are of a physical core, and so does 'core2' and 'core3' and others. MSVC 14.24 was used as the compiler.
The time recorded was an approximate value of the best score in several runs since there were tons of background tasks. I think this was fair enough since the results can be clearly divided into groups and 0.1s~0.3s error did not affect such division.
Test2 was quite easy to explain. As x
and y
are in different cache lines, running on 2 physical core can gain 2 times performance boost(the cost of context switch when running 2 threads on a single core is ignorable here), and running on one core with SMT is less efficient than 2 physical cores, limited by the throughput of coffee-lake(believe Ryzen can do slightly better), and more efficient than temporal multithreading. It seems 64bit mode is more efficient here.
But the result of test1 is confusing to me. First, in debug mode, 0-2, 0-3, and 0-5 are slower than 0-0, which makes sense. I explained this as certain data was moved from L1 to L3 and L3 to L1 repeatedly since the cache must stay coherent among 2 cores, while it would always stay in L1 when running on a single core. But this theory conflicts with the fact that 0-1 pair is always the slowest. Technically, the two threads should share the same L1 cache. 0-1 should run 2 times as fast as 0-0.
Second, in release mode, 0-2, 0-3, and 0-5 were faster than 0-0, which disproved the theory above.
Last, 0-1 runs slower in release
than in debug
in both 64bit and 32bit mode. That's what I can't understand most. I read the generated assembly code and did not find anything helpful.