19

I have a __m256d vector packed with four 64-bit floating-point values.
I need to find the horizontal maximum of the vector's elements and store the result in a double-precision scalar value;

My attempts all ended up using a lot of shuffling of the vector elements, making the code not very elegant nor efficient. Also, I found it impossible to stay only in the AVX domain. At some point I had to use SSE 128-bit instructions to extract the final 64-bit value. However, I would like to be proved wrong on this last statement.

So the ideal solution will:
1) only use only AVX instructions.
2) minimize the number of instructions. (I am hoping for no more than 3-4 instructions)

Having said that, any elegant/efficient solution will be accepted, even if it doesn't adhere to the above guidelines.

Thanks for any help.

-Luigi

Luigi Castelli
  • 676
  • 2
  • 6
  • 13
  • 3
    That's a tough one... Are you doing this with only 1 vector? Or do you have many vectors for which you need to find the max? You can (fairly) efficient do 4 of these in parallel with a 4 x 4 vector transpose... – Mysticial Mar 20 '12 at 22:28
  • @Mysticial: Well... I am dealing with many vectors. However the simplicity of the processing doesn't justify two 4x4 transpose operations for every iteration. So I am processing everything "horizontally" without transposition. I am getting a great speed-up that way, close to 4x, because I am avoiding the overhead of transposing. Everything is in a tight loop manually unrolled 4 times. However, when the loop is over I am left with one last AVX vector. I have to find the greatest of its four elements in order to store the result back into my double-precision scalar value. Hence my question... – Luigi Castelli Mar 20 '12 at 22:56
  • If it isn't in the "tight loop", is it even performance critical? – Mysticial Mar 20 '12 at 22:59
  • This time, not really... :) but I know I will run into a situation where it will be performance critical. That's why I formulated the question the way I did... – Luigi Castelli Mar 20 '12 at 23:01
  • 1
    Ah :) In that case, the best way to do this would probably be highly specific to how it's used. In other words, it's not vectorizable at this level, but can you push it to a higher level... – Mysticial Mar 20 '12 at 23:09
  • What do you mean by "pushing it to a higher level" ? – Luigi Castelli Mar 20 '12 at 23:13
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/9115/discussion-between-luigi-castelli-and-mysticial) – Luigi Castelli Mar 20 '12 at 23:18
  • 2
    Note that you can stay in the AVX domain while using 128 bit instructions. There are actually 3 kind of instructions: AVX256, AVX128 and legacy SSE128. A switch between the first two and the latter is to be avoided, its costly on Intel (not so on AMD), but the first two can be mixed almost freely (you may have to insert `vzeroupper` sometimes) – Gunther Piez Mar 21 '12 at 09:09

4 Answers4

23

I don't think you can do much better than 4 instructions: 2 shuffles and 2 comparisons.

__m256d x = ...; // input

__m128d y = _mm256_extractf128_pd(x, 1); // extract x[2], and x[3]
__m128d m1 = _mm_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3])
__m128d m2 = _mm_permute_pd(m1, 1); // set m2[0] = m1[1], m2[1] = m1[0]
__m128d m = _mm_max_pd(m1, m2); // both m[0] and m[1] contain the horizontal max(x[0], x[1], x[2], x[3])

Trivial modification to only work with 256-bit vectors:

__m256d x = ...; // input

__m256d y = _mm256_permute2f128_pd(x, x, 1); // permute 128-bit values
__m256d m1 = _mm256_max_pd(x, y); // m1[0] = max(x[0], x[2]), m1[1] = max(x[1], x[3]), etc.
__m256d m2 = _mm256_permute_pd(m1, 5); // set m2[0] = m1[1], m2[1] = m1[0], etc.
__m256d m = _mm256_max_pd(m1, m2); // all m[0] ... m[3] contain the horizontal max(x[0], x[1], x[2], x[3])

(untested)

Norbert P.
  • 2,767
  • 1
  • 18
  • 22
  • The all-256 version is good on Intel CPUs if you need the result broadcasted, but it's much slower on Ryzen. See [Get sum of values stored in \_\_m256d with SSE/AVX](https://stackoverflow.com/a/49943540). (And BTW, `_mm_unpackhi_pd` is 2 bytes shorter than `_mm_permute_pd`, so use that if you only want a scalar result. No immediate needed, and can use a 2-byte VEX prefix.) – Peter Cordes May 05 '18 at 13:37
13

The general way of doing this for a vector v1 = [A, B, C, D] is

  1. Permute v1 to v2 = [C, D, A, B] (swap 0th and 2nd elements, and 1st and 3rd ones)
  2. Take the max; i.e. v3 = max(v1,v2). You now have [max(A,C), max(B,D), max(A,C), max(B,D)]
  3. Permute v3 to v4, swapping the 0th and 1st elements, and the 2nd and 3rd ones.
  4. Take the max again, i.e. v5 = max(v3,v4). Now v5 contains the horizontal max in all of its components.

Specifically for AVX, the permutations can be done with _mm256_permute_pd and the maximums can be done with _mm256_max_pd. I don't have the exact permute masks handy but they should be pretty straightforward to figure out.

Hope that helps.

celion
  • 3,864
  • 25
  • 19
  • 1
    I especially like your solution, because so far it is the only one that uses AVX instructions exclusively, without ever leaving the 256-bit domain. Thanks. – Luigi Castelli Mar 21 '12 at 08:12
  • 1
    sorry, I talked too soon... You can't do that with AVX. Most AVX operations do not cross the 128-bit boundary. So in this case you cannot swap the 0th and 2nd elements and the 1st and 3rd. The AVX permute operation only allows you to swap the 0th and 1st elements or the 2nd and 3rd. – Luigi Castelli Mar 21 '12 at 08:23
  • @LuigiCastelli: my solution can be written so as to never leave the 256-bit domain, if you want. Replace `_mm256_extractf128_pd` by `_mm256_permute2f128_pd(x, x, 1)`, `__m128d` by `__m256d`, and `_mm_...` by `_mm256_...`, `_mm_permute_pd(m1, 1)` by `_mm256_permute_pd(m1, 5)`. – Norbert P. Mar 21 '12 at 08:43
3
//Use the code to find the horizontal maximum
__m256 v1 = initial_vector;//example v1=[1 2 3 4 5 6 7 8]
__m256 v2 = _mm256_permute_ps(v1,(int)147);//147 is control code for rotate left by upper 4 elements and lower 4 elements separately v2=[2 3 4 1 6 7 8 5]
__m256 v3 = _mm256_max_ps(v1,v2);//v3=[2 3 4 4 6 7 8 8]
__m256 v4 = _mm256_permute_ps(v3,(int)147);//v4=[3 4 4 2 7 8 8 6]
__m256 v5 = _mm256_max_ps(v3,v4);//v5=[3 4 4 4 7 8 8 8]
__m256 v6 = _mm256_permute_ps(v5,(int)147);//v6=[4 4 4 3 8 8 8 7]
__m256 v7 = _mm256_max_ps(v5,v6);//contains max of upper four elements and lower 4 elements. v7=[4 4 4 4 8 8 8 8]

//to get max of this horizontal array. Note that the highest end of either upper or lower can contain the maximum
float ALIGN max_array[8];
float horizontal_max;
_mm256_store_ps(max_array, v7);
if(max_array[3] > max_array[7])
{
    horizontal_max = max_array[3];
}
else
{
    horizontal_max = max_array[7];
}
joyx
  • 77
  • 4
  • 1
    It will take one extra step for float vectors, but storing to an array and doing a scalar comparison isn't one of the steps. You still want to start with an `extractf128` / 128bit `maxps`. Doing in-lane stuff first isn't any better on Intel CPUs, and definitely worse on AMD CPUs where 256b AVX ops are twice as expensive as 128b AVX ops. Either way, a 256b store and then two loads -> a scalar compare is just silly, and slower than an `extractf128`. – Peter Cordes Jan 21 '16 at 03:41
0

This doesn't specifically answer your question since you are using doubles, but here is the code I used to get the maximum of 8 single values. It is built off of the answer by @celion and @Norbert P.

#define HORIZONTAL_MAX_256(ymmA, result) \
                                                            /*      [upper   |   lower]                                                                         */ \
                                                            /*      [7 6 5 4 | 3 2 1 0]                                                                         */ \
    __m256 v1 = ymmA;                                       /* v1 = [H G F E | D C B A]                                                                         */ \
    __m256 v2 = _mm256_permute_ps(v1, 0b10'11'00'01);       /* v2 = [G H E F | C D A B]                                                                         */ \
    __m256 v3 = _mm256_max_ps(v1, v2);                      /* v3 = [W=max(G,H) W=max(G,H) Z=max(E,F) Z=max(E,F) | Y=max(C,D) Y=max(C,D) X=max(A,B) X=max(A,B)] */ \
                                                            /* v3 = [W W Z Z | Y Y X X]                                                                         */ \
    __m256 v4 = _mm256_permute_ps(v3, 0b00'00'10'10);       /* v4 = [Z Z W W | X X Y Y]                                                                         */ \
    __m256 v5 = _mm256_max_ps(v3, v4);                      /* v5 = [J=max(Z,W) J=max(Z,W) J=max(Z,W) J=max(Z,W) | I=max(X,Y) I=max(X,Y) I=max(X,Y) I=max(X,Y)] */ \
                                                            /* v5 = [J J J J | I I I I]                                                                         */ \
    __m128 v6 = _mm256_extractf128_ps(v5, 1);               /* v6 = [- - - - | J J J J]                                                                         */ \
    __m128 v7 = _mm_max_ps(_mm256_castps256_ps128(v5), v6); /* v7 = [- - - - | M=max(I,J) M=max(I,J) M=max(I,J) M=max(I,J)]                                     */ \
                                                            /* v7 = [- - - - | M M M M]                                                                         */ \
                                                            /* M = max(I,J)                                                                                     */ \
                                                            /* M = max(max(X,Y),max(Z,W))                                                                       */ \
                                                            /* M = max(max(max(A,B),max(C,D)),max(max(E,F),max(G,H)))                                           */ \
    _mm_store_ss(&result, v7);

edit

Using the VCL2 (Vector Class Library 2) lib, it seems to produce assembly code that is similar to what Peter Cordes is talking about in the comments. Here is the assembly code that VCL2 generated for my project:

vextractf128    xmm1, ymm0, 1     # ymm0 is the register to find the min of
vmaxps          xmm1, xmm0, xmm1
vpermilpd       xmm2, xmm1, 3
vmaxps          xmm1, xmm2, xmm1
vpsrldq         xmm2, xmm1, 4
vmaxps          xmm1, xmm2, xmm1
vbroadcastss    ymm2, xmm1        # This would be the save, it is setting up for the next instructions specific to my code
Alex Jorgenson
  • 891
  • 1
  • 13
  • 17
  • If you want a scalar result, start with `_mm256_extractf128_ps` so you can use 128-bit operations for the rest. On Zen 1 (and the E-cores of Alder Lake), 128-bit operations are 1 uop vs. 2 for 256-bit. Also `vmovhlps` saves a couple machine code bytes (shorter VEX prefix, and no immediate) vs. `vpermilps` xmm or ymm. Also, `vpermilps ymm` has 3 cycle latency on Zen2, vs. 1 cycle for 128-bit shuffles. (https://uops.info/). See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) for more about shuffle patterns for horizontal stuff. – Peter Cordes Aug 22 '22 at 02:15
  • 1
    @PeterCordes That all makes sense. I've actually moved to using the VCL2 library and looking at the assembled code it looks to take the approach you outlined above. – Alex Jorgenson Aug 23 '22 at 03:20
  • Those are some pretty random shuffle choices. Is that VCL2 asm from LLVM's shuffle optimizer? `vmovhlps` or `vunpckhpd` would both be 2 bytes shorter than `vpermilpd xmm,xmm, 3`, although Ice Lake can only run them on port 5, not also on its extra shuffle unit on port 1 (`vshufps` can.) And [`vmovshdup`](https://www.felixcloutier.com/x86/movshdup) could replace that weird integer byte-shift for the last shuffle, bringing element 1 down to 0. – Peter Cordes Aug 23 '22 at 04:25
  • Also, you could update your answer to do operations in a more efficient order. OTOH, if you *do* want the result broadcast, you can keep it 256-bit the whole time, using `vperm2f128` and `vpermilps` or `vshufps`. That's equal efficiency on mainstream Intel CPUs at least (plus avoiding the broadcast at the end), but higher shuffle latency on Zen 2. And more uops on Zen1 and Alder Lake E-cores. – Peter Cordes Aug 23 '22 at 04:28