0

I'd like to compute the norm of a vector stored in a __mm256d variable.
In order to do so, I implemented the ymmnorm function saving the result is a __mm256d variable:

__m256d ymmnorm(__m256d const x)
{
    return _mm256_sqrt_pd(ymmdot(x, x));
};

exploiting the dot product function suggested here

__m256d ymmdot(__m256d const x, __m256d const y)
{
    __m256d xy = _mm256_mul_pd(x, y);
    __m256d temp = _mm256_hadd_pd(xy, xy);
    __m128d hi128 = _mm256_extractf128_pd(temp, 1);
    __m128d dotproduct = _mm_add_pd(_mm256_castpd256_pd128(temp), hi128);

    return _mm256_broadcast_pd(&dotproduct);
};

However, I am a newbie in the SIMD/AVX world. Thus, I am wondering: is there a smarter/better method to compute the vector norm of a 256-bits variable?

CaG
  • 65
  • 6
  • 2
    You should avoid `_mm256_broadcast_pd` in this case as it uses memory as the source, which forces an unnecessary store. You can just use `_mm256_insertf128_pd` instead. Alternatively, use `_mm256_permute2f128_pd` to swap 128-bit lanes before the final addition and use 256-bit addition. This will also get rid of the `_mm256_extractf128_pd`. – Andrey Semashev Jul 06 '21 at 11:34
  • 1
    When you want the horizontal sum broadcast to all elements, do it with shuffles that create that result in the first place. (Like `_mm256_hadd_pd` does, except that's an inefficient first step. Better to just shuffle once manually with `_mm256_shuffle_pd`) – Peter Cordes Jul 06 '21 at 13:18
  • 1
    Are you sure you need the result in every position or just once? `vsqrtpd ymm,ymm` is on many CPUs twice as slow as `vsqrtpd xmm,xmm` or `vsqrtsd xmm,xmm`. It could even make sense to compute just one sqrt and broadcast the result. – chtz Jul 06 '21 at 13:25
  • @chtz I need the result in every position, because I have to use it in other AVX2 instructions later in the code. – CaG Jul 06 '21 at 13:35
  • @chtz: Oh good point, `_mm256_broadcastsd_pd( _mm_sqrt_sd(hsum) )` would be better throughput on many CPUs. (`vsqrtpd ymm` has the same latency on Skylake / Ice Lake, though.) – Peter Cordes Jul 06 '21 at 17:16
  • @AndreySemashev: `_mm256_broadcastsd_pd` takes a `__m128d` operand; it's the intrinsic for the AVX2 form that takes an XMM source, instead of memory for the AVX1 form. https://www.felixcloutier.com/x86/vbroadcast. Although in practice most compilers would optimize to a register operand, it's still easier and better not to take the address. – Peter Cordes Jul 06 '21 at 17:17
  • So yeah, it's a choice between SQRT throughput vs. an extra lane-crossing shuffle uop costing front-end throughput, and latency (which matters if this is on the critical path of a long dependency chain). – Peter Cordes Jul 06 '21 at 17:24
  • @PeterCordes `_mm256_broadcast_pd`, which I was arguing against, translates to `VBROADCASTF128`, which uses a memory source. That instruction does not have a variant with a register source, and I don't think compilers will be able to deduce from the code that replacing it with `VBROADCASTSD` in this case would be legitimate. So yes, one should avoid `_mm256_broadcast_pd` in this case, and yes, `_mm256_broadcastsd_pd` is one of the possible replacements. – Andrey Semashev Jul 06 '21 at 22:04
  • 1
    @AndreySemashev: Oh yes, you're right, it's not a different intrinsic for another form of the *same* instruction, it's a different instruction altogether. The two interesting choices are 256-bit the whole way with `vshufpd ymm` / `vperm2f128 ymm` for a low-throughput `vsqrtpd ymm`, or narrow to scalar and then broadcast with `vextractf128` / `vunpckhpd xmm` / `vsqrtsd xmm` / `vbroadcastsd ymm, xmm` (with vaddpd operations between those) – Peter Cordes Jul 07 '21 at 02:58
  • Thank you all for your valuable contribution. @PeterCordes: Thank you, but I still do not get how to substitute `_mm256_hadd_pd` with `_mm256_shuffle_pd`. – CaG Jul 07 '21 at 07:23
  • 1
    @CaG I think, the idea is to use `_mm256_shuffle_pd` to swap adjacent pairs of `double` elements and then use a vertical `_mm256_add_pd`. Shuffle+add is one cycle less latency than hadd on Ice Lake, two on Skylake. – Andrey Semashev Jul 07 '21 at 08:47
  • Thank you again for the suggestions. As now, starting from the various combinations you listed, I found that the fastest solution on my computer (equipped with a Intel i7-10700) is having `ymmdot` with `_mm256_shuffle_pd`/`_mm256_insertf128_pd`, while `ymmnorm` returns `_mm256_set1_pd(sqrt(_mm256_cvtsd_f64(fert::ymmdot(x, x))))`. This is a little bit complicated because it goes from 256-bit to double and back to 256-bit, but it avoids `_mm256_sqrt_pd` and AVX-512 instructions. – CaG Jul 07 '21 at 15:40

1 Answers1

1

Assuming you need that exact prototype, I would do it like this:

__m256d ymmnorm( __m256d x )
{
    const __m256d x2 = _mm256_mul_pd( x, x );
    __m128 vec16 = _mm_add_pd( _mm256_castpd256_pd128( x2 ), _mm256_extractf128_pd( x2 ) );
    vec16 = _mm_add_sd( vec16, _mm_unpackhi_pd( vec16, vec16 ) );
    vec16 = _mm_sqrt_sd( vec16 );
    return _mm256_broadcastsd_pd( vec16 );
};

Here’s an alternative but I’d expect the first one to be slightly faster on most processors.

__m256d ymmnorm( __m256d x )
{
    __m256d x2 = _mm256_mul_pd( x, x );
    __m256d tmp = _mm256_permute4x64_pd( x2, _MM_SHUFFLE( 1, 0, 3, 2 ) );
    x2 = _mm256_add_pd( x2, tmp );
    tmp = _mm256_permute_pd( x2, _MM_SHUFFLE2( 0, 1 ) );
    x2 = _mm256_add_pd( x2, tmp );
    return _mm256_sqrt_pd( x2 );
};
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • As discussed in comments, `_mm256_sqrt_pd` has worse throughput than `_mm_sqrt_sd` on most CPUs. It's still 1 uop (at least on Skylake), so if this is an occasional part of a larger loop body (that doesn't use other div / sqrt operations) your 2nd version is good. But if you could be running into sqrt throughput bottlenecks, the first version may be worth the extra uop. (Especially on Zen1, although you might speed it up by arranging to use `vinsertf128` instead of `vbroadcastsd` by using `_mm_sqrt_pd` and shuffles that produce the hsum in both elements of the `__m128d`.) – Peter Cordes Jul 13 '21 at 18:54