8

How does one efficiently perform horizontal addition with floats in a 512-bit AVX register (ie add the items from a single vector together)? For 128 and 256 bit registers this can be done using _mm_hadd_ps and _mm256_hadd_ps but there is no _mm512_hadd_ps. The Intel intrinsics guide documents _mm512_reduce_add_ps. It doesn't actually correspond to a single instruction but its existence suggests there is an optimal method, but it doesn't appear to be defined in the header files that come with the latest snapshot of GCC and I can't find a definition for it with Google.

I figure "hadd" can be emulated with _mm512_shuffle_ps and _mm512_add_ps or I could use _mm512_extractf32x4_ps to break a 512-bit register into four 128-bit registers but I want to make sure I'm not missing something better.

Z boson
  • 32,619
  • 11
  • 123
  • 226
Rouslan
  • 93
  • 1
  • 6
  • 3
    What exactly are you trying to do with a horizontal operation? If it's the end of a large reduction operation, then it probably isn't even performance-critical. (Nevertheless, `_mm512_reduce_add_ps`, exists for that purpose and compiles to a binary reduction of shuffles and sums.) – Mysticial Nov 12 '14 at 21:04
  • 1
    I'm not surprised this is missing, as AVX-512 is viewed a bit as a departure from the standard "double the width" improvement. Operations are already cut up into 128-bit or 256-bit uops, so horizontal instructions wouldn't make much sense yet. – Cory Nelson Nov 12 '14 at 21:10
  • 2
    @CoryNelson To make it worse, horizontal instructions are microcoded on existing processors. So they're already slow. And also, horizontally vectorized tasks violate the SIMD paradigm and don't scale. – Mysticial Nov 12 '14 at 21:15
  • To answer the question of what I am trying to do: I am trying to do dot products of vectors with sixteen or more dimensions. I try to work on multiple entities simultaneously where I can, so I don't have to do horizontal operations, but I can't always do that. – Rouslan Nov 12 '14 at 22:06
  • 1
    @Mystical Horizontal operations are microcoded only on `AMD Bulldozer/Piledriver/Steamroller` – Marat Dukhan Nov 13 '14 at 02:33
  • 1
    @MaratDukhan According to Agner Fog's tables, they are also microcoded on Prescott, Core 2, Nehalem, Sandy Bridge, Haswell, Atom, and Via Nano. Which pretty much covers everything else. He doesn't have any information on K10. And the entry is blank for K8. – Mysticial Nov 13 '14 at 05:37
  • 1
    @Mysticial How did you conclude that? They decode to multiple uops, but it doesn't mean that they are microcoded. – Marat Dukhan Nov 13 '14 at 07:19
  • 1
    @MaratDukhan Then I think we might have slightly different definitions for "micro-coded". (perhaps I'm using the term incorrectly) The horizontal instructions all decode into separate arithmetic and shuffle uops which basically means the executions units can't do it. The penalty of course is poor throughput. – Mysticial Nov 13 '14 at 07:58
  • Isn't this question answered by the first comment? `_mm512_reduce_add_ps` does the horizontal sum of 16 floats in a AVX512 register. – Z boson Nov 13 '14 at 08:49
  • @Mysticial, if I had to guess microcoded means it's not broken into separate uops. For example `REP MOVS` is implemented with microcode. – Z boson Nov 13 '14 at 08:53
  • @Zboson: I'm using GCC, which doesn't have reduce_add. – Rouslan Nov 13 '14 at 23:01
  • @Rouslan, sorry I did not make that clear in my answer but yes those intrinsics apply *only* to the Intel compiler currently. – Z boson Nov 14 '14 at 08:45
  • @Rouslan, BTW, how are you using AVX512? It's not even out yet. Emulator? And Xeon Phi's 512-bit SIMD is not exactly the same as AVX512. – Z boson Nov 14 '14 at 08:47
  • @Zboson Actually, I don't even have a CPU that supports AVX. I'm working on [a little project](https://github.com/Rouslan/NTracer) that anyone can download, which includes a set of classes that provide a consistent interface regardless of SIMD support. I just felt like being thorough and support even the stuff I can't use yet. The classes are generated by a Python script that queries a list of intrinsics and their requirements to make supporting multiple SIMD types as painless as possible. – Rouslan Nov 14 '14 at 09:59
  • @Zboson Once I get the project to actually compile with AVX512, I'll test it with the Intel Software Development Emulator or something (Intel SDE requires disabling SELinux so I would prefer something else). – Rouslan Nov 14 '14 at 10:01
  • @Rouslan, that looks really cool! I'll try and check it out soon. I wrote a real time Whitted style ray tracer with OpenCL (with refelection and refraction). It has several features. Solving the Fresnel equations is one of the coolest features. It make a big difference in quality. I'm porting it to the Oculus rift now. – Z boson Nov 14 '14 at 10:04
  • In terms of a Vector Class I would just Agner Fog's VCL I mentioned in my answer. Why reinvent the wheel? – Z boson Nov 14 '14 at 10:05
  • 1
    @Zboson Thanks. Re the VCL library: I didn't know about it until now. I had looked at another library but it didn't suit my needs because it only provided support for one vector size at a time. I really didn't look very hard for an existing library because after I downloaded the offline version of the Intel Intrinsics Guide, I noticed that all the information about the intrinsics were conveniently stored in an XML file (inside the jar file) and thought "hey, I can use this to generate a common interface for all types and sizes!" It was an interesting exercise so I don't regret it. – Rouslan Nov 14 '14 at 10:35
  • I would just the VCL. I have used it for about two years now and only in a few cases did I have to implement my own intrinsics to do better. – Z boson Nov 14 '14 at 10:46

3 Answers3

6

The INTEL compiler has the following intrinsic defined to do horizontal sums

_mm512_reduce_add_ps     //horizontal sum of 16 floats
_mm512_reduce_add_pd     //horizontal sum of 8 doubles
_mm512_reduce_add_epi32  //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64  //horizontal sum of 8 64-bit integers

However, as far as I can tell these are broken into multiple instructions anyway so I don't think you gain anything more than doing the horizontal sum of the upper and lower part of the AVX512 register.

__m256 low  = _mm512_castps512_ps256(zmm);
__m256 high = _mm256_castpd_ps(_mm512_extractf64x4_pd(_mm512_castps_pd(zmm),1));

__m256d low  = _mm512_castpd512_pd256(zmm);
__m256d high = _mm512_extractf64x4_pd(zmm,1);

__m256i low  = _mm512_castsi512_si256(zmm);
__m256i high = _mm512_extracti64x4_epi64(zmm,1);

To get the horizontal sum you then do sum = horizontal_add(low + high).

static inline float horizontal_add (__m256 a) {
    __m256 t1 = _mm256_hadd_ps(a,a);
    __m256 t2 = _mm256_hadd_ps(t1,t1);
    __m128 t3 = _mm256_extractf128_ps(t2,1);
    __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
    return _mm_cvtss_f32(t4);        
}

static inline double horizontal_add (__m256d a) {
    __m256d t1 = _mm256_hadd_pd(a,a);
    __m128d t2 = _mm256_extractf128_pd(t1,1);
    __m128d t3 = _mm_add_sd(_mm256_castpd256_pd128(t1),t2);
    return _mm_cvtsd_f64(t3);        
}

I got all this information and functions from Agner Fog's Vector Class Library and the Intel Instrinsics Guide online.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Are you sure there aren't `_ps` versions of the extract high 256 intrinsic? Seems really weird to cast to `_pd` there. But yes, a good first step is to extract the high 256 and vertical add. But then do the same thing down to 128, then and use better shuffles than `vhaddps`, which costs 2 shuffle uops + a vertical add. See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86. – Peter Cordes Feb 28 '18 at 20:31
  • 3
    I would generally prefer to use the direct `reduce_add` intrinsic because it clearly expresses intent both to the human reader of the code, and to the compiler, which usually optimizes better when it knows what you're really trying to do. – Alex Reinking Sep 14 '21 at 18:40
  • 2
    @AlexReinking: Yes, and compilers expand them to patterns that don't inefficiently use `vhaddps` / `pd` (3 uops each). e.g. https://godbolt.org/z/PboP3aneK shows clang using `vpshufd` / `vpaddq` and stuff like that. See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) and [Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2](https://stackoverflow.com/q/60108658) which looks at asm output for GCC/clang. They're pretty good. – Peter Cordes Feb 25 '22 at 22:51
0

I'll give Z boson the check, as the post does answer my question, but I think the exact sequence of instructions can be improved upon:

inline float horizontal_add(__m512 a) {
    __m512 tmp = _mm512_add_ps(a,_mm512_shuffle_f32x4(a,a,_MM_SHUFFLE(0,0,3,2)));
    __m128 r = _mm512_castps512_ps128(_mm512_add_ps(tmp,_mm512_shuffle_f32x4(tmp,tmp,_MM_SHUFFLE(0,0,0,1))));
    r = _mm_hadd_ps(r,r);
    return _mm_cvtss_f32(_mm_hadd_ps(r,r));
}
Rouslan
  • 93
  • 1
  • 6
  • I'm glad you found a better solution that works for you. You can get a free non-commercial version of the Intel compiler for Linux and then you could look at the disassembly to see what it does with `_mm512_reduce`. But you should keep in mind that you should *not* be doing horizontal add in critical loops. It defeats the purpose of SIMD. – Z boson Nov 14 '14 at 08:41
  • @Zboson A free version of the Intel compiler would be nice but when I go to the Non-Commercial Software Development section of Intel's website, it just has one page saying "this site is under revision." It's been that way for some time. As for the horizontal add comment: I know, but when it can't be avoided, it's better than adding 16 numbers together, one at a time. And it's not like I'm trying to optimize a single operation; I have a special array (C++) class that hides all the SIMD code (which is also the basis of my vector class), that I'm trying to optimize. – Rouslan Nov 14 '14 at 09:31
  • That's bad news. I was not aware the non-commerical software version is "is under revision". Well, if it's any consolation ICC is overrated in my opinion except for it's libraries (e.g. MKL) which are very good. – Z boson Nov 14 '14 at 09:38
  • Now that I look at this I am not sure it's better than my solution. Your and my solution use six instructions (ignoring the `_mm_cvtss_f32`). You have just written your solution in a from which makes it look shorter because you packed multiple intrinsics per line. Your solution is still interesting though. – Z boson Dec 29 '15 at 19:00
  • But your solution calls horizonal_add(__m256) twice. Assuming the calls are inlined, that's a total of ten instructions. – Rouslan Jan 05 '16 at 05:33
  • Oh...yeah...ehhh... I meant `sum = horizontal_add(low + high)`. I fixed my answer. – Z boson Jan 05 '16 at 06:37
  • where `low + high = _mm256_add_ps(low,high)` for single. – Z boson Jan 05 '16 at 09:40
  • Is there also a solution for double precision and `__m512` (except `_mm512_reduce_add_pd`, which is too slow i guess)? – boraas Sep 23 '19 at 09:35
  • Yours uses `_mm_hadd_ps` twice, each of which costs 2 shuffle + 1 add uop. That's pretty bad when you actually only need 1 shuffle + 1 add to extract the high half. (https://agner.org/optimize/). And BTW, it might be more efficient to extract the high half instead of using an `_mm512` full-width shuffle and add. Reducing to 256-bit earlier might allow Skylake-avx512 CPUs to start running SIMD uops on port 1 again slightly sooner. And any future (AMD?) CPUs that implement AVX512 by splitting into 2x 256-bit will benefit. – Peter Cordes Sep 23 '19 at 19:54
0

horizontal sum for double precision:

static inline double _mm512_horizontal_add(__m512d a){
    __m256d b = _mm256_add_pd(_mm512_castpd512_pd256(a), _mm512_extractf64x4_pd(a,1));
    __m128d d = _mm_add_pd(_mm256_castpd256_pd128(b), _mm256_extractf128_pd(b,1));
    double *f = (double*)&d;
    return _mm_cvtsd_f64(d) + f[1];
}

edit: applied comments of Peter Cordes

boraas
  • 929
  • 1
  • 10
  • 24
  • 1
    I wouldn't recommend `hadd_pd`: it costs 2 shuffle + 1 add uops, instead of just 1 shuffle for a manual extract. Also, you're using the `+` operator which is a GNU C native vector extension. You're also depending on the gcc/clang definition of `__m512i` as a vector of `long long` so the `+` there is `_mm256_add_epi64`, not some other integer width. I don't think that's ever going to change but it's not generally good style, IMO. – Peter Cordes Sep 23 '19 at 19:58
  • Good edit until strict-aliasing undefined behaviour from pointer-casting. Just use another shuffle like a normal person, e.g. `_mm_unpackhi_pd`, instead of tempting the compiler into spilling `d` and doing a scalar reload. You can `_mm_cvtsd_f64` both halves if you want for a scalar `+`, or use `_mm_add_sd` or `_pd`. [Fastest way to do horizontal float vector sum on x86](//stackoverflow.com/a/35270026) shows a hack using `movhlps` which is possibly worth it without AVX, but pointless with AVX to avoid `movaps` copies. See `highhalf_pd` in that answer. – Peter Cordes Sep 24 '19 at 21:34