I'm new to the world of intrinsics, and I got here because I saw a way to achieve transparent code compilation i.e. what you see is what you get. Also, reproducibility. For a system supporting e.g. AVX2 I know I'll end up with the same instructions at the end, given I use AVX2 intrinsics. This is an important step towards writing HPC libraries which make use of SIMD. Feel free to correct me in my way of thinking.
Now, I have implemented a 3D vector dot product function in three variants in a micro-benchmarking setting. The code has been compiled using the GNU compiler v11.1.0 and run on a machine with a Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz chip and 32 GiB of DDR4 RAM. Single thread read-write memory bandwidth of said system has been measured at ~34 GiB/s by running a DAXPY benchmark.
First, let me present the elementary structures.
struct vector3
{
float data[3] = {};
inline float& operator()(const std::size_t& index) { return data[index]; }
inline const float& operator()(const std::size_t& index) const { return data[index]; }
inline float l2_norm_sq() const { return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; }
};
// strictly speaking, the following is a class of its own that implements a subset of
// the functionality of the std::vector. The motivation is to be able to allocate memory
// without "touching" the data, a requirement that is crucial for "cold" microbenchmarking.
template<class Treal_t>
using vector3_array = std::vector<vector3<Treal_t>>;
The first is my scalar code. I'm compiling it with the flag "-O0".
void dot_product_novec(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
static constexpr auto inc = 6;
static constexpr auto dot_products_per_inc = inc / 3;
const auto stream_size_div = varray.size() * 3 / inc * inc;
const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
auto dot_product_index = std::size_t{};
for (auto index = std::size_t{}; index < varray.size(); index += inc, dot_product_index += dot_products_per_inc)
{
dot_products[dot_product_index] = float_stream[index] * float_stream[index] + float_stream[index + 1] * float_stream[index + 1]
+ float_stream[index + 2] * float_stream[index + 2];
dot_products[dot_product_index + 1] = float_stream[index + 3] * float_stream[index + 3]
+ float_stream[index + 4] * float_stream[index + 4] + float_stream[index + 5] * float_stream[index + 5];
}
for (auto index = dot_product_index; index < varray.size(); ++index)
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
Next up is my auto-vectorized loop. I'm strongly recommending auto-vectorization using the corresponding directive of OpenMP 4.0. Compiled with flags "-O3;-ffast-math;-march=native;-fopenmp".
void dot_product_auto(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
#pragma omp simd safelen(16)
for (auto index = std::size_t{}; index < varray.size(); ++index)
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
Finally, here's my version which has been vectorized using intrinsics. Compiled using "-O3;-ffast-math;-march=native;-mfma;-mavx2".
void dot_product(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
static constexpr auto inc = 6;
static constexpr auto dot_products_per_inc = inc / 3;
const auto stream_size_div = varray.size() * 3 / inc * inc;
const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
auto dot_product_index = std::size_t{};
static const auto load_mask = _mm256_setr_epi32(-1, -1, -1, -1, -1, -1, 0, 0);
static const auto permute_mask0 = _mm256_setr_epi32(0, 1, 2, 7, 3, 4, 5, 6);
static const auto permute_mask1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 4, 0);
static const auto store_mask = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
for (auto index = std::size_t{}; index < stream_size_div; index += inc, dot_product_index += dot_products_per_inc)
{
// 1. load and permute the vectors
const auto point_packed = _mm256_maskload_ps(float_stream + index, load_mask);
const auto point_permuted_packed = _mm256_permutevar8x32_ps(point_packed, permute_mask0);
// 2. do a multiply
const auto point_permuted_elementwise_sq_packed = _mm256_mul_ps(point_permuted_packed, point_permuted_packed);
// 3. do 2 horizontal additions
const auto hadd1 = _mm256_hadd_ps(point_permuted_elementwise_sq_packed, point_permuted_elementwise_sq_packed);
const auto hadd2 = _mm256_hadd_ps(hadd1, hadd1);
// 4. permute to target position
const auto result_packed = _mm256_permutevar8x32_ps(hadd2, permute_mask1);
// 4. store
_mm256_maskstore_ps(&dot_products[dot_product_index], store_mask, result_packed);
}
for (auto index = dot_product_index; index < varray.size(); ++index) // no opt for remainder loop
{
dot_products[index] = varray[index].l2_norm_sq();
}
}
I've tested the code, so I know it works.
Now, brief details about the microbenchmarking:
- I use a small library which I've written for this purpose: https://gitlab.com/anxiousprogrammer/tixl.
- 20 warm up runs, 100 timed runs.
- fresh allocations in each run for cold microbenchmarking, first touch (zeroing of the first datum in each memory page) of test data prevents measuring of page-faults.
I'm modelling the dot product so: 5 * size FLOPs after 5 * size * sizeof(float) transfers i.e code-balance of 4 or computational intensity of 0.25. Using this information, here are the performance results in terms of effective bandwidth:
- no-vec: 18.6 GB/s
- auto-vec: 21.3 GB/s
- intrinsic-vec: 16.4 GB/s
Questions:
- Is my motivation (mentioned in paragraph 1) a sensible one?
- Why is my version slower than the scalar code?
- Why are they all far from the peak read-write BW of 34 GiB/s?
Please excuse the lack of a minimum reproducer, the amount of code would be too much. Thanks a lot for your thoughts and inputs.