Counter-intuitive results while playing with intrinsics

Question

I'm new to the world of intrinsics, and I got here because I saw a way to achieve transparent code compilation i.e. what you see is what you get. Also, reproducibility. For a system supporting e.g. AVX2 I know I'll end up with the same instructions at the end, given I use AVX2 intrinsics. This is an important step towards writing HPC libraries which make use of SIMD. Feel free to correct me in my way of thinking.

Now, I have implemented a 3D vector dot product function in three variants in a micro-benchmarking setting. The code has been compiled using the GNU compiler v11.1.0 and run on a machine with a Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz chip and 32 GiB of DDR4 RAM. Single thread read-write memory bandwidth of said system has been measured at ~34 GiB/s by running a DAXPY benchmark.

First, let me present the elementary structures.

struct vector3
{
    float data[3] = {};
    inline float& operator()(const std::size_t& index) { return data[index]; }
    inline const float& operator()(const std::size_t& index) const { return data[index]; }
    inline float l2_norm_sq() const { return data[0] * data[0] + data[1] * data[1] + data[2] * data[2]; }
};

// strictly speaking, the following is a class of its own that implements a subset of
// the functionality of the std::vector. The motivation is to be able to allocate memory
// without "touching" the data, a requirement that is crucial for "cold" microbenchmarking.
template<class Treal_t>
using vector3_array = std::vector<vector3<Treal_t>>;

The first is my scalar code. I'm compiling it with the flag "-O0".

void dot_product_novec(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
    static constexpr auto inc = 6;
    static constexpr auto dot_products_per_inc = inc / 3;
    const auto stream_size_div = varray.size() * 3 / inc * inc;

    const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
    auto dot_product_index = std::size_t{};
    for (auto index = std::size_t{}; index < varray.size(); index += inc, dot_product_index += dot_products_per_inc)
    {
        dot_products[dot_product_index] = float_stream[index] * float_stream[index] + float_stream[index + 1] * float_stream[index + 1]
            + float_stream[index + 2] * float_stream[index + 2];
        dot_products[dot_product_index + 1] = float_stream[index + 3] * float_stream[index + 3]
            + float_stream[index + 4] * float_stream[index + 4] + float_stream[index + 5] * float_stream[index + 5];
    }
    for (auto index = dot_product_index; index < varray.size(); ++index)
    {
        dot_products[index] = varray[index].l2_norm_sq();
    }
}

Next up is my auto-vectorized loop. I'm strongly recommending auto-vectorization using the corresponding directive of OpenMP 4.0. Compiled with flags "-O3;-ffast-math;-march=native;-fopenmp".

void dot_product_auto(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
    #pragma omp simd safelen(16)
    for (auto index = std::size_t{}; index < varray.size(); ++index)
    {
        dot_products[index] = varray[index].l2_norm_sq();
    }
}

Finally, here's my version which has been vectorized using intrinsics. Compiled using "-O3;-ffast-math;-march=native;-mfma;-mavx2".

void dot_product(const vector3_array<float>& varray, std::vector<float>& dot_products)
{
    static constexpr auto inc = 6;
    static constexpr auto dot_products_per_inc = inc / 3;
    const auto stream_size_div = varray.size() * 3 / inc * inc;
    
    const auto* float_stream = reinterpret_cast<const float*>(&varray[0](0));
    auto dot_product_index = std::size_t{};
    
    static const auto load_mask = _mm256_setr_epi32(-1, -1, -1, -1, -1, -1, 0, 0);
    static const auto permute_mask0 = _mm256_setr_epi32(0, 1, 2, 7, 3, 4, 5, 6);
    static const auto permute_mask1 = _mm256_set_epi32(0, 0, 0, 0, 0, 0, 4, 0);
    static const auto store_mask = _mm256_set_epi32(0, 0, 0, 0, 0, 0, -1, -1);
    
    for (auto index = std::size_t{}; index < stream_size_div; index += inc, dot_product_index += dot_products_per_inc)
    {
        // 1. load and permute the vectors
        const auto point_packed = _mm256_maskload_ps(float_stream + index, load_mask);
        const auto point_permuted_packed = _mm256_permutevar8x32_ps(point_packed, permute_mask0);
        
        // 2. do a multiply
        const auto point_permuted_elementwise_sq_packed = _mm256_mul_ps(point_permuted_packed, point_permuted_packed);
        
        // 3. do 2 horizontal additions
        const auto hadd1 = _mm256_hadd_ps(point_permuted_elementwise_sq_packed, point_permuted_elementwise_sq_packed);
        const auto hadd2 = _mm256_hadd_ps(hadd1, hadd1);
        
        // 4. permute to target position
        const auto result_packed = _mm256_permutevar8x32_ps(hadd2, permute_mask1);
        
        // 4. store
        _mm256_maskstore_ps(&dot_products[dot_product_index], store_mask, result_packed);
    }
    for (auto index = dot_product_index; index < varray.size(); ++index) // no opt for remainder loop
    {
        dot_products[index] = varray[index].l2_norm_sq();
    }
}

I've tested the code, so I know it works.

Now, brief details about the microbenchmarking:

I use a small library which I've written for this purpose: https://gitlab.com/anxiousprogrammer/tixl.
20 warm up runs, 100 timed runs.
fresh allocations in each run for cold microbenchmarking, first touch (zeroing of the first datum in each memory page) of test data prevents measuring of page-faults.

I'm modelling the dot product so: 5 * size FLOPs after 5 * size * sizeof(float) transfers i.e code-balance of 4 or computational intensity of 0.25. Using this information, here are the performance results in terms of effective bandwidth:

no-vec: 18.6 GB/s
auto-vec: 21.3 GB/s
intrinsic-vec: 16.4 GB/s

Questions:

Is my motivation (mentioned in paragraph 1) a sensible one?
Why is my version slower than the scalar code?
Why are they all far from the peak read-write BW of 34 GiB/s?

Please excuse the lack of a minimum reproducer, the amount of code would be too much. Thanks a lot for your thoughts and inputs.

`_mm256_hadd_ps` inside the inner-most loop = you're almost certainly doing it wrong, bottlenecking on shuffle throughput of 1/clock on your Skylake-derived microarchitecture. Each `hadd` costs 2 shuffle uops, and `_mm256_permutevar8x32_ps` is another. See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) for one dot-product. — Peter Cordes, Feb 09 '23 at 09:39
For two dot-products within 128-bit halves of a vector, you'd want to adapt those manual shuffles. You might want to do a 32-byte load that splits a pair of float3 vectors down the middle so you can use in-lane shuffles on them. Stop short of the end of the array if necessary to avoid over-read of the last element without using maskload inside each iteration, unless it's cheaper to just get the 4th element zeroed and do a normal dot-product? Probably not, should be shuffle / add / shuffle / add still, but you're shuffling the original product twice instead of a serial dep chain. — Peter Cordes, Feb 09 '23 at 09:42
1. You need to enable optimizations for the vectorized code, i.e. add -O3. 2. See if you can avoid `_mm256_maskload_ps` and load full vectors instead. Either one full load and ignore or zero the excess elements or 3 full loads and unroll the loop 3x. — Andrey Semashev, Feb 09 '23 at 10:10
3. You should generally widen your loop so that it performs full vector stores instead of `_mm256_maskstore_ps` of only two elements. That is, you should load full 3 vectors, transpose horizontal 3-float tuples to vertical, perform vertical accumulation and store a full vector of results. Bottom line is for best performance your vectors should always be fully utilized with useful data. — Andrey Semashev, Feb 09 '23 at 10:15
@PeterCordes Why do you suggest that in-lane shuffles are cheaper than `_mm256_permutevar8x32_ps`? According to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX,AVX2,FMA&ig_expand=3143,5428,6670 for Skylake (assuming the same applies to my Coffee Lake), we seem to have the same latency/throughput figures for both. The same should apply for `_mm256_loadu_ps` and `_mm256_maskload_ps`, where the latency differs by 1 but the throughput is the same. — Nitin Malapally, Feb 09 '23 at 10:39
@AndreySemashev (also as suggested by @PeterCordes) How do you suggest I zero excess elements? Both your inputs have been very helpful. — Nitin Malapally, Feb 09 '23 at 10:44
If you have an array of vector3f stored consecutively, you can first transpose 8 of them to 3 vectors `[x0, x1, x2, ...]; [y0, y1, y2, ...]; [z0, ..., z7]` (e.g., based on this https://www.intel.com/content/dam/develop/external/us/en/documents/normvec-181650.pdf) and then do SoA operations. If the idea is unclear I can maybe write an answer later. — chtz, Feb 09 '23 at 10:48
@NitinMalapally: Fair point, on Intel in this case there's no direct benefit to avoiding lane-crossing shuffles, except you can maybe avoid needing a vector constant. Or not since you'd need different shuffles in the two halves. Ice Lake has 2/clock throughput for `vshufps` vs. 1/clock for `vpermilps` or `vpermps`. Zen 1 has slower `vpermps` (3 uops) because it's doesn't decompose into two 128-bit uops. Anyway yes, OoO exec can hide the extra latency since it's not part of a loop-carried dep chain, and you only need dword granularity so AVX2 `vpermps` has a lane-crossing shuffle available. — Peter Cordes, Feb 09 '23 at 17:23
And BTW, other commenters are correct that a much better strategy would compute more dot products at once, e.g. transposing. Since you have packed 3-element vectors, you can't just feed 2 different vectors to `vhaddps` (and maybe wouldn't want to if you care about AMD). Also, `maskstore` is very expensive on AMD, and not free on Intel, so if you did want to store the low two floats, use `vmovlps` (`_mm256_castps256_ps128` for the `movlps` intrinsic.) — Peter Cordes, Feb 09 '23 at 17:50
Also as Andrey says, you *need* to enable optimization for any benchmarking to be meaningful. `-O0` absolutely destroys the performance of intrinsics, since they're mostly functions not macros. *[SSE intrinsics without compiler optimization](https://stackoverflow.com/q/37223499)*. Also, aiming for DRAM bandwidth is usually a pretty low bar. With cache blocking, hopefully you can get hits in L3 or even L2, giving you much higher bandwidth, especially if other threads are active and would compete for DRAM. Benchmark repeated passes over a smallish buffer, as well as with a large buffer. — Peter Cordes, Feb 09 '23 at 18:34
@chtz If you could formulate an answer with that link, we could close this thread. — Nitin Malapally, Feb 10 '23 at 08:52
@PeterCordes I aimed at the DRAM bandwidth because the vector dot product shows a streaming access pattern, and there are no inner loops. So I would be happy if the code could come close to the DRAM bandwidth -- which it surprisingly doesn't! — Nitin Malapally, Feb 10 '23 at 08:54
I'm not surprised *this* code doesn't, especially compiled without optimization. I'm surprised it's not slower with *just* `gcc -mavx2 -mfma`, with the default `-O0`. A bit disappointing GCC with OpenMP doesn't auto-vectorize to something faster; barely beating unoptimized scalar code is weird. — Peter Cordes, Feb 10 '23 at 09:05
You should edit the question to re-test with optimization enabled for all 3 cases, otherwise it's mostly a duplicate of [Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled](//stackoverflow.com/q/27075540) / [SSE intrinsics without compiler optimization](//stackoverflow.com/q/37223499) -unoptimized code is always slow, and intrinsics tend to have an even larger percentage slowdown. This may mean rewriting a lot of the text since the numbers probably change dramatically, but that's good because as of now the only interesting part is how to manually vectorize. — Peter Cordes, Feb 10 '23 at 09:09
Your optimized intrinsics are still slower than *unoptimized* `-O0` scalar? Ouch. Have you tried compiling that version with optimization, as a more interesting baseline? It does enough in one big expression that GCC may manage to make non-terrible asm even at `-O0`, but you're certainly not doing it any favours spending some of your load/store execution unit throughput on store/reload of locals on the stack. — Peter Cordes, Feb 10 '23 at 09:35
@PeterCordes Nothing changed for the no-vec variant after I switched from "-O0;-fno-tree-vectorize" to "-O3". This is alarming. — Nitin Malapally, Feb 10 '23 at 09:50
Maybe it really is bottlenecked on some kind of bandwidth limit after all? Read+write can be lower aggregate than read-only or write-only, but that does seem too low if you aren't getting TLB misses and page faults. So I'd be very suspicious of your benchmarking code and build methodology if `-O3` isn't faster than `-O0`. (`-ftree-vectorize` is only on by default at `-O3`, or in GCC12 at `-O2` with a conservative cost model, only if it's "very cheap". It's super weird to see an explicit "no" option along with `-O0`, which implies no optimization or register usage between statements.) — Peter Cordes, Feb 10 '23 at 10:22
@PeterCordes The perils of cold benchmarking. Turns out there was a bug in the laying of the first touch. I was measuring page-faults. Please refer to new bandwidth figures. — Nitin Malapally, Feb 10 '23 at 13:23
@PeterCordes it seems that in the case of the no-vec variant, that through my loop unrolling, I'd manually implemented optimization approximately on the level of "-O3". I got rid of the manual unrolling, and the performance dropped to less than that of the auto-vec variant, ~14 GiB/s. — Nitin Malapally, Feb 10 '23 at 13:42
Is the "no vec" variant still getting built with `-O0`? Stop wasting our time with that crap already. Use at least `-O2 -fno-tree-vectorize` as a scalar baseline, perhaps without any `-march=native` if you don't want it to be able to use scalar FMA. It's normal and good to unroll a bit in the source if you aren't expecting auto-vectorization. — Peter Cordes, Feb 10 '23 at 16:18

Soonts · Accepted Answer · 2023-02-14T11:27:31.520

Your manually-vectorized code is not particularly efficient.
Try to benchmark the following 2 versions instead.

This one is simpler, and only requires SSE 4.1 instruction set.

inline __m128 loadFloat3( const float* rsi )
{
    __m128 xy = _mm_castpd_ps( _mm_load_sd( (const double*)rsi ) );
    // Compilers should merge following 2 lines into single INSERTPS with a memory operand
    __m128 z = _mm_load_ss( rsi + 2 );
    return _mm_insert_ps( xy, z, 0x20 );
}
// Simple version which uses DPPS instruction from SSE 4.1 set
void dotProductsSimple( float* rdi, size_t length, const float* rsi )
{
    const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
    const float* const rsiEnd = rsi + length * 3;
    for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
    {
        // Load complete 16 byte vector, discard the W
        __m128 v = _mm_loadu_ps( rsi );
        v = _mm_dp_ps( v, v, 0b01110001 );
        _mm_store_ss( rdi, v );
    }
    if( rsi < rsiEnd )
    {
        // For the last vector, load exactly 12 bytes.
        // Avoids potential crash when loading from out of bounds
        __m128 v = loadFloat3( rsi );
        v = _mm_dp_ps( v, v, 0b01110001 );
        _mm_store_ss( rdi, v );
    }
}

This one is more complicated, and requires AVX1 support. Probably, going to be slightly faster on most processors.

void dotProductTransposed( float* rdi, size_t length, const float* rsi )
{
    constexpr size_t maskAlign8 = ~(size_t)7;
    const float* const rsiEndAligned = rsi + ( length & maskAlign8 ) * 3;
    const float* const rsiEndMinusOne = rsi + ( (ptrdiff_t)length - 1 ) * 3;
    const float* const rsiEnd = rsi + length * 3;

    while( rsi < rsiEndAligned )
    {
        // Load lower halves
        __m256 m03, m14, m25;
        m03 = _mm256_castps128_ps256( _mm_loadu_ps( rsi ) );
        m14 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 4 ) );
        m25 = _mm256_castps128_ps256( _mm_loadu_ps( rsi + 8 ) );
        // Load upper halves; VINSERTF128 supports memory operand for the second argument.
        m03 = _mm256_insertf128_ps( m03, _mm_loadu_ps( rsi + 12 ), 1 );
        m14 = _mm256_insertf128_ps( m14, _mm_loadu_ps( rsi + 16 ), 1 );
        m25 = _mm256_insertf128_ps( m25, _mm_loadu_ps( rsi + 20 ), 1 );
        rsi += 24;

        // Transpose these SIMD vectors
        __m256 xy = _mm256_shuffle_ps( m14, m25, _MM_SHUFFLE( 2, 1, 3, 2 ) );
        __m256 yz = _mm256_shuffle_ps( m03, m14, _MM_SHUFFLE( 1, 0, 2, 1 ) );
        __m256 x = _mm256_shuffle_ps( m03, xy, _MM_SHUFFLE( 2, 0, 3, 0 ) );
        __m256 y = _mm256_shuffle_ps( yz, xy, _MM_SHUFFLE( 3, 1, 2, 0 ) );
        __m256 z = _mm256_shuffle_ps( yz, m25, _MM_SHUFFLE( 3, 0, 3, 1 ) );

        // Now we have 3 SIMD vectors with gathered x/y/z fields of 8 source 3D vectors
        // Compute squares
        x = _mm256_mul_ps( x, x );
        y = _mm256_mul_ps( y, y );
        z = _mm256_mul_ps( z, z );
        // Add squares
        x = _mm256_add_ps( x, y );
        x = _mm256_add_ps( x, z );
        // Store 8 values
        _mm256_storeu_ps( rdi, x );
        rdi += 8;
    }

    // Handle the remainder
    for( ; rsi < rsiEndMinusOne; rsi += 3, rdi++ )
    {
        __m128 v = _mm_loadu_ps( rsi );
        v = _mm_dp_ps( v, v, 0b01110001 );
        _mm_store_ss( rdi, v );
    }
    if( rsi < rsiEnd )
    {
        __m128 v = loadFloat3( rsi );
        v = _mm_dp_ps( v, v, 0b01110001 );
        _mm_store_ss( rdi, v );
    }
}

After the first MULPS, the next two can be contracted into FMAs if you write it that way or use `-ffp-contract=fast` to let compilers contract across C++ statements. — Peter Cordes, Feb 11 '23 at 20:08
@PeterCordes I have an AMD CPU here, the performance tradeoffs are different. On AMD, additions and multiplications don’t compete for EU, they run in parallel on different ones. Another thing is numerical precision. I think I have tested that, and I found vdpps an equivalent of ((x1*x2)+(y1*y2))+(z1*z2), in that order, with all intermediate values being FP32. When merging these instructions to FMA, you gonna have slightly different results between the main AVX-vectorized portion of the input, and the remainder. In many cases that’s fine, in some cases can be an issue. — Soonts, Feb 11 '23 at 22:53
FMA saves front-end uops on any CPU. And if those uops compete for any of the same ports as shuffles or `vinsertf128`, then overall back-end port pressure. With 6 loads, FP execution-unit throughput alone isn't going to be the bottleneck. Fair point that it's different from your one-at-a-time fallback, but that's not gaining a huge amount over pure scalar, especially if scalar could use FMA instructions. (Especially as cleanup code, where mostly front-end uop count matters, as OoO exec will overlap it with later different code, unlike a scalar loop over the whole array) — Peter Cordes, Feb 11 '23 at 23:07
Since `dpps` allows selecting which input values are used, you could do a regular unaligned load (assuming it is save to load the previous or next 32bits). — chtz, Feb 12 '23 at 16:32
@chtz Good idea, and no side effects when done correctly. Updated. — Soonts, Feb 14 '23 at 11:28

Counter-intuitive results while playing with intrinsics

1 Answers1