I've written a naive and an "optimized" transpose functions for order-3 tensors containing double-precision complex numbers and I would like to analyze their performance.
Approximate code for naive transpose function:
#pragma omp for schedule(static)
for (auto i2 = std::size_t(0); i2 < n2; ++i2)
{
for (auto i1 = std::size_t{}; i1 < n1; ++i1)
{
for (auto i3 = std::size_t{}; i3 < n3; ++i3)
{
tens_tr(i3, i2, i1) = tens(i1, i2, i3);
}
}
}
Approximate code for optimized transpose function (remainder loop not shown, assume divisibility):
#pragma omp for schedule(static)
for (auto i2 = std::size_t(0); i2 < n2; ++i2)
{
// blocked loop
for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
{
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
cache_buffer[i3 * block_size + i1] = tens(bi1 + i1, i2, bi3 + i3);
}
}
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
tens_tr(bi3 + i1, i2, bi1 + i3) = cache_buffer[i1 * block_size + i3];
}
}
}
}
}
Assumption: I decided to use a streaming function as reference because I reasoned that the transpose function, in its perfect implementation, would closely resemble any bandwidth-saturating streaming function.
For this purpose, I chose the DAXPY loop as reference.
#pragma omp parallel for schedule(static)
for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
{
auto* slice_a = reinterpret_cast<double*>(tens_a_->get_slice_data(i1));
auto* slice_b = reinterpret_cast<double*>(tens_b_->get_slice_data(i1));
const auto slice_size = 2 * tens_a_->get_slice_size(); // 2 doubles for a complex
#pragma omp simd safelen(8)
for (auto index = std::size_t{}; index < slice_size; ++index)
{
slice_b[index] += lambda_ * slice_a[index]; // fp_count: 2, traffic: 2+1
}
}
Also, I used a simple copy kernel as a second reference.
#pragma omp parallel for schedule(static)
for (auto i1 = std::size_t{}; i1 < tens_a_->get_n1(); ++i1)
{
const auto* op1_begin = reinterpret_cast<double*>(tens_a_->get_slice_data(index));
const auto* op1_end = op1_begin + 2 * tens_a_->get_slice_size(); // 2 doubles in a complex
auto* op2_iter = reinterpret_cast<double*>(tens_b_->get_slice_data(index));
#pragma omp simd safelen(8)
for (auto* iter = op1_begin; iter != op1_end; ++iter, ++op2_iter)
{
*op2_iter = *iter;
}
}
Hardware:
- Intel(R) Xeon(X) Platinum 8168 (Skylake) with 24 cores @ 2.70 GHz and L1, L2 and L3 caches sized 32 kB, 1 MB and 33 MB respectively.
- Memory of 48 GiB @ 2666 MHz. Intel Advisor's roof-line view says memory BW is 115 GB/s.
Benchmarking: 20 warm-up runs, 100 timed experiments, each with newly allocated data "touched" such that page-faults will not be measured.
Compiler and flags:
Intel compiler from OneAPI 2022.1.0, optimization flags -O3;-ffast-math;-march=native;-qopt-zmm-usage=high
.
Results (sizes assumed to be adequately large):
Using 24 threads pinned on 24 cores (total size of both tensors ~10 GiB):
DAXPY 102 GB/s
Copy 101 GB/s
naive transpose 91 GB/s
optimized transpose 93 GB/sUsing 1 thread pinned on a single core (total size of both tensors ~10 GiB):
DAXPY 20 GB/s
Copy 20 GB/s
naive transpose 9.3 GB/s
optimized transpose 9.3 GB/s
Questions:
- Why is my naive transpose function performing so well?
- Why is the difference in performance between reference and transpose functions so high when using only 1 thread?
I'm glad to receive any kind of input for any of the above questions. Also, I will gladly provide additional information when required. Unfortunately, I cannot provide a minimum reproducer because of the size and complexity of each benchmark program. Thank you very much for you time and help in advance!
Updates:
- Could it be that the Intel compiler performed loop-blocking for the naive transpose function as optimization?