What's the problem
I am benchmarking the following code for (T& x : v) x = x + x;
where T is int
.
When compiling with mavx2
Performance fluctuates 2 times depending on some conditions.
This does not reproduce on sse4.2
I would like to understand what's happening.
How does the benchmark work
I am using Google Benchmark. It spins the loop until the point it is sure about the time.
The main benchmarking code:
using T = int;
constexpr std::size_t size = 10'000 / sizeof(T);
NOINLINE std::vector<T> const& data()
{
static std::vector<T> res(size, T{2});
return res;
}
INLINE void double_elements_bench(benchmark::State& state)
{
auto v = data();
for (auto _ : state) {
for (T& x : v) x = x + x;
benchmark::DoNotOptimize(v.data());
}
}
Then I call double_elements_bench
from multiple instances of a benchmark driver.
Machine, Compiler, Options
- processor: intel 9700k
- compiler: clang ~14, built from trunk.
- options:
-mavx2 --std=c++20 --stdlib=libc++ -DNDEBUG -g -Werror -Wall -Wextra -Wpedantic -Wno-deprecated-copy -O3
I did align all functions to 128 to try, had no effect.
Results
When duplicated 2 times I get:
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
double_elements_0 105 ns 105 ns 6617708
double_elements_1 105 ns 105 ns 6664185
Vs duplicated 3 times:
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
double_elements_0 64.6 ns 64.6 ns 10867663
double_elements_1 64.5 ns 64.5 ns 10855206
double_elements_2 64.5 ns 64.5 ns 10868602
This reproduces on bigger data sizes too.
Perf stats
I looked for counters that I know can be relevant to code alignment
LSD cache (which is off on my machine due to some security issue a few years back), DSB cache and branch predictor:
LSD.UOPS,idq.dsb_uops,UOPS_ISSUED.ANY,branches,branch-misses
Slow case
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
double_elements_0 105 ns 105 ns 6663885
double_elements_1 105 ns 105 ns 6632218
Performance counter stats for './transform_alignment_issue':
0 LSD.UOPS
13,830,353,682 idq.dsb_uops
16,273,127,618 UOPS_ISSUED.ANY
761,742,872 branches
34,107 branch-misses # 0.00% of all branches
1.652348280 seconds time elapsed
1.633691000 seconds user
0.000000000 seconds sys
Fast case
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
double_elements_0 64.5 ns 64.5 ns 10861602
double_elements_1 64.5 ns 64.5 ns 10855668
double_elements_2 64.4 ns 64.4 ns 10867987
Performance counter stats for './transform_alignment_issue':
0 LSD.UOPS
32,007,061,910 idq.dsb_uops
37,653,791,549 UOPS_ISSUED.ANY
1,761,491,679 branches
37,165 branch-misses # 0.00% of all branches
2.335982395 seconds time elapsed
2.317019000 seconds user
0.000000000 seconds sys
Both look to me about the same.
UPD
I think this might be alignment of the data returned from malloc
0x4f2720 in fast case and 0x8e9310 in slow
So - since clang does not align - we get unaligned reads/writes. I tested on a transform that aligns - does not seem to have this variation.
Is there a way to confirm it?