Horizontal XOR in AVX

Question

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register?

The goal is to get the XOR of all 4 64-bit components of an AVX register. It would essentially be doing the same thing as a horizontal add (_mm256_hadd_epi32()), except that I want to XOR instead of ADD.

The scalar code is:

inline uint64_t HorizontalXor(__m256i t) {
  return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}

This might be helpful https://stackoverflow.com/questions/42040937/horizontal-xor-of-two-sse-values — NathanTempelman, Jul 05 '17 at 21:18
It would probably be faster to do this using non-SIMD instructions. You need three `XOR`s and you're done. (Especially if you want the result in an integer register anyway, which is what the code sample implies.) — Cody Gray - on strike, Jul 06 '17 at 11:33
@CodyGray , so is this code good as is? Or can it be faster with some get/extract instructions on the `YMM` register containing `t` parameter? — Serge Rogatch, Jul 06 '17 at 11:36
Well, how good the code is depends on which compiler you're using. :-) I'm assuming that the use of `m256i_u64` means MSVC? (This doesn't compile in GCC or Clang, AFAIK.) And [the output in MSVC looks pretty good](https://godbolt.org/g/5JFvLJ). Pretty hard to imagine that you could beat a few extracts and moves. Have you profiled that this is actually a bottleneck? — Cody Gray - on strike, Jul 06 '17 at 11:46
@CodyGray , yes, it's MSVC++2017 . I'm currently far before profiling phase - in deep implementation. But this horizontal xor is in the heart of a random number generator, so it's expected to be a bottleneck in some use-cases. — Serge Rogatch, Jul 06 '17 at 11:56
I'm not really sure how you got yourself into a situation where you need to do horizontal operations in the first place. SIMD operations are designed to scale *vertically*, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. Generate the 4 random numbers in 4 *different* AVX registers. — Cody Gray - on strike, Jul 06 '17 at 12:15
@CodyGray, indeed, that's a great idea, thanks! Still an answer to this question may be useful for someone, I think. — Serge Rogatch, Jul 06 '17 at 12:25
You've been asking a bunch of good but small x86 questions these days... Clearly you're working on something bigger. It's like a version of an X-Y problem. Maybe you could show us the bigger picture and we can contribute? — Iwillnotexist Idonotexist, Jul 06 '17 at 15:48
Is your `t.m256i_u64[0]` etc actually portable? Looks very much like a compiler-specific extension to me. Which compiler? — Walter, Jul 06 '17 at 15:50
@IwillnotexistIdonotexist , thanks, I've pushed what I'm doing to https://github.com/srogatch/ProbQA . It has large cube in its heart: `nAnswers` * `nQuestions` * `nTargets` and a few less-dimensional arrays containing aggregates. I'm currently implementing CPU engine for it (well, it's x86_64 engine only, but I don't plan it for e.g. ARM yet, and supercomputer engine would have its own name), but CUDA and network grid engines are also planned. Mathematically it's based on Bayesian formula and naive Bayes assumption. — Serge Rogatch, Jul 06 '17 at 17:34

Cody Gray - on strike · Accepted Answer · 2017-07-08T01:12:39.457

As stated in the comments, the fastest code very likely uses scalar operations, doing everything in the integer registers. All you need to do is extract the four packed 64-bit integers, then you have three XOR instructions, and you're done. This can be done pretty efficiently, and it leaves the result in an integer register, which is what your sample code suggests that you would want.

MSVC already generates pretty good code for the scalar function that you show as an example in the question:

inline uint64_t HorizontalXor(__m256i t) {
  return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}

Assuming that t is in ymm1, the resulting disassembly will be something like this:

vextractf128 xmm0, ymm1, 1
vpextrq      rax,  xmm0, 1
vmovq        rcx,  xmm1
xor          rax,  rcx
vpextrq      rcx,  xmm1, 1
vextractf128 xmm0, ymm1, 1
xor          rax,  rcx
vmovq        rcx,  xmm0
xor          rax,  rcx

…with the result left in RAX. If this accurately reflects what you need (a scalar uint64_t result), then this code would be sufficient.

You can slightly improve it by using intrinsics:

inline uint64_t _mm256_hxor_epu64(__m256i x)
{
   const __m128i temp = _mm256_extracti128_si256(x, 1);
   return (uint64_t&)x
          ^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
          ^ (uint64_t&)(temp)
          ^ (uint64_t)(_mm_extract_epi64(temp, 1));
}

Then you'll get the following disassembly (again, assuming that x is in ymm1):

vextracti128 xmm2, ymm1, 1
vpextrq      rcx,  xmm2, 1
vpextrq      rax,  xmm1, 1
xor          rax,  rcx
vmovq        rcx,  xmm1
xor          rax,  rcx
vmovq        rcx,  xmm2
xor          rax,  rcx

Notice that we were able to elide one extraction instruction, and that we've ensured VEXTRACTI128 was used instead of VEXTRACTF128 (although, this choice probably does not matter).

You'll see similar output on other compilers. For example, here's GCC 7.1 (with x assumed to be in ymm0):

vextracti128 xmm2, ymm0, 0x1
vpextrq      rax,  xmm0, 1
vmovq        rdx,  xmm2
vpextrq      rcx,  xmm2, 1
xor          rax,  rdx
vmovq        rdx,  xmm0
xor          rax,  rdx
xor          rax,  rcx

The same instructions are there, but they've been slightly reordered. The intrinsics allow the compiler's scheduler to order as it deems best. Clang 4.0 schedules them differently yet:

vmovq        rax,  xmm0
vpextrq      rcx,  xmm0, 1
xor          rcx,  rax
vextracti128 xmm0, ymm0, 1
vmovq        rdx,  xmm0
xor          rdx,  rcx
vpextrq      rax,  xmm0, 1
xor          rax,  rdx

And, of course, this ordering is always subject to change when the code gets inlined.

On the other hand, if you want the result to be in an AVX register, then you first need to decide how you want it to be stored. I guess you would just store the single 64-bit result as a scalar, something like:

inline __m256i _mm256_hxor(__m256i x)
{
   const __m128i temp = _mm256_extracti128_si256(x, 1);
   return _mm256_set1_epi64x((uint64_t&)x
                             ^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
                             ^ (uint64_t&)(temp)
                             ^ (uint64_t)(_mm_extract_epi64(temp, 1)));
}

But now you're doing a lot of data shuffling, negating any performance boost that you would possibly see from vectorizing the code.

Speaking of which, I'm not really sure how you got yourself into a situation where you need to do horizontal operations like this in the first place. SIMD operations are designed to scale vertically, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. In particular, you should be generating the 4 integer values in 4 different AVX registers, rather than packing them all into one.

If you actually want 4 copies of the result packed into an AVX register, then you could do something like this:

inline __m256i _mm256_hxor(__m256i x)
{
   const __m256i temp = _mm256_xor_si256(x,
                                         _mm256_permute2f128_si256(x, x, 1));    
   return _mm256_xor_si256(temp,
                           _mm256_shuffle_epi32(temp, _MM_SHUFFLE(1, 0, 3, 2)));
}

This still exploits a bit of parallelism by doing two XORs at once, meaning that only two XOR operations are required in all, instead of three.

If it helps to visualize it, this basically does:

   A         B         C         D           ⟵ input
  XOR       XOR       XOR       XOR
   C         D         A         B           ⟵ permuted input
=====================================
  A^C       B^D       A^C        B^D         ⟵ intermediate result
  XOR       XOR       XOR        XOR
  B^D       A^C       B^D        A^C         ⟵ shuffled intermediate result
======================================
A^C^B^D   A^C^B^D   A^C^B^D    A^C^B^D      ⟵ final result

On practically all compilers, these intrinsics will produce the following assembly code:

vperm2f128  ymm0, ymm1, ymm1, 1    ; input is in YMM1
vpxor       ymm2, ymm0, ymm1
vpshufd     ymm1, ymm2, 78
vpxor       ymm0, ymm1, ymm2

(I had come up with this on my way to bed after first posting this answer, and planned to come back and update the answer, but I see that wim beat me to the punch on posting it. Oh well, it's still a better approach than what I first had, so it still merits being included here.)

And, of course, if you wanted this in an integer register, you would just need a simple VMOVQ:

vperm2f128  ymm0, ymm1, ymm1, 1    ; input is in YMM1
vpxor       ymm2, ymm0, ymm1
vpshufd     ymm1, ymm2, 78
vpxor       ymm0, ymm1, ymm2
vmovq       rax,  xmm0

The question is, would this be faster than the scalar code above. And the answer is, yes, probably. Although you are doing the XORs using the AVX execution units, instead of the completely separate integer execution units, there are fewer AVX shuffles/permutes/extracts that need to be done, which means less overhead. So I might also have to eat my words on scalar code being the fastest implementation. But it really depends on what you're doing and how the instructions can be scheduled/interleaved.

For swapping the two lanes of a ymm register, `vpermq` should be preferred over `vperm2i128`. It only has one input, which makes it much faster on Ryzen and KNL. They're the same performance on Intel Haswell/Skylake. — Peter Cordes, Jul 08 '17 at 04:21
Of course, `vextracti128` is even better on Ryzen, and 128b operations are only a single uop. If you don't need the result broadcast to all elements, narrowing down to 128b as early as possible is a good strategy for horizontal ops in general, including this one. But `vpextrq` is relatively expensive in uop count, so it does make sense to shuffle/xor down to a scalar in the bottom of an xmm register, then use one `vmovq` (to an integer register or to memory). The same applies to other horizontal ops, [including integer sums](https://stackoverflow.com/a/35270026/224132). — Peter Cordes, Jul 08 '17 at 04:25
*completely separate integer execution units*: they're on the same ports as the vector execution units in Intel CPUs, except for Haswell and later's port6 which has integer ALUs (and the taken-branch unit) but no vector execution units. So there's only a tiny amount of extra ALU throughput to be gained from mixing in scalar instructions, but it costs a lot of front-end throughput and p0 / p5 uops to get the data to it. (Front-end throughput is a problem for this idea on AMD, too, even though the integer and vector uops use different pipes). — Peter Cordes, Jul 08 '17 at 04:35
Worth considering: an in-lane `vpshufd`, `vpxor ymm`, `vextracti128`, 2x `vmovq`, 2x scalar `xor`. 7 uops total on Intel, and the first `vmovq` can execute while the 2nd is waiting for the `vextracti128` result. On Intel, latency is no better than your final sequence, but it costs more total uops (which need to run in parallel for latency to not be worse). So it can't overlap as well with surrounding code. — Peter Cordes, Jul 08 '17 at 04:38

score 4 · Answer 2 · answered Jul 06 '17 at 22:40

Vectorization is likely to be useful if the input of the horizontal xor-function is already in an AVX register, i.e. your t is the result of some SIMD computation. Otherwise, scalar code is likely to be faster, as already mentioned by @Cody Gray. Often you can do horizontal SIMD operations in about log_2(SIMD_width) 'steps'. In this case one step is a 'shuffle/permute' and a 'xor'. This is slightly more efficient than @Cody Gray 's _mm256_hxor function:

__m256i _mm256_hxor_v2(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);       // swap the 128 bit high and low lane 
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);    // swap 64 bit lanes                         
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return x3;
}

This compiles to:

vperm2i128  $1, %ymm0, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0

If you want the result in a scalar register:

uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}

Which compiles to:

vperm2i128  $1, %ymm0, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vmovq   %xmm0, %rax

Full test code:

#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>
/*  gcc -O3 -Wall -m64 -march=broadwell hor_xor.c   */
int print_vec_uint64(__m256i v);

__m256i _mm256_hxor_v2(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
/* Uncomment the next few lines to print the values of the intermediate variables */ 
/*
    printf("3...0        =          3          2          1          0\n");
    printf("x            = ");print_vec_uint64(x        );
    printf("x0           = ");print_vec_uint64(x0        );
    printf("x1           = ");print_vec_uint64(x1        );
    printf("x2           = ");print_vec_uint64(x2        );
    printf("x3           = ");print_vec_uint64(x3        );
*/
    return x3;
}

uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}


int main() {
    __m256i x = _mm256_set_epi64x(0x7, 0x5, 0x2, 0xB);
//    __m256i x = _mm256_set_epi64x(4235566778345231, 1123312566778345423, 72345566778345673, 967856775433457);

    printf("x            = ");print_vec_uint64(x);

    __m256i y = _mm256_hxor_v2(x);

    printf("y            = ");print_vec_uint64(y);

    uint64_t z = _mm256_hxor_v2_uint64(x);

    printf("z =  %10lX  \n",z);

    return 0;
}


int print_vec_uint64(__m256i v){
    uint64_t t[4];
    _mm256_storeu_si256((__m256i *)t,v);
    printf("%10lX %10lX %10lX %10lX \n",t[3],t[2],t[1],t[0]);
    return 0;
}

Indeed, my original solution was sub-optimal. I posted the answer right before turning in for the night, and then on my way to bed realized a better solution. Having come back to update, I see that you had already posted it. I went ahead and updated my answer for completeness, but have an upvote! — Cody Gray - on strike, Jul 07 '17 at 08:09
@CodyGray Loosely speaking, the SIMD complexity of 'simple' horizontal operations, such as horizontal sum, product, minimum, maximum, logical and, etc. is often O(log(n)) instead of O(n), where n is the number of elements in the SIMD register. Sometimes this is quite obvious, for example with the [horizontal minimum](https://stackoverflow.com/a/43271592/2439725). Sometimes it is less obvious, [such as this one](https://stackoverflow.com/a/43392973/2439725). — wim, Jul 07 '17 at 09:51
Most of [my comments on Cody's update](https://stackoverflow.com/questions/44935902/horizontal-xor-in-avx#comment76940009_44953526) apply here too: reduce down to 128b as the first step (faster on Ryzen and Excavator), and avoid `vperm2i128` when you don't need it. `vextracti128` is excellent on Ryzen, and `vpermq` is better than `vperm2?128` for swapping upper/lower lanes. — Peter Cordes, Jul 08 '17 at 04:43
When you *do* want a result broadcast to every element instead of reducing to 128 and then scalar, doing the in-lane shuffle first is probably *slightly* better, since the lower latency means more uops/instructions can execute (and retire) sooner, freeing up space in the reservation station and ROB. It's probably non-trivial to even construct an artificial test that could measure the difference, but I think it can't hurt. This also applies when reducing to scalar, but in that case staying 256b for longer means extra uops on AMD CPUs, so I'd recommend reducing to 128b first. — Peter Cordes, Jul 08 '17 at 04:46
@PeterCordes Thanks for your insightful comments! I didn't even think about Ryzen when I wrote my answer. I'll update my answer later on. — wim, Jul 08 '17 at 08:37
:) Even without Ryzen, presumably narrowing to 128b ASAP has energy/power advantages. Might be more relevant for FP add than for XOR, but still very small. Also, a speed advantage on e.g. Skylake if the CPU is still in AVX "warm-up" mode where the upper lane isn't active yet. — Peter Cordes, Jul 08 '17 at 08:41

score 2 · Answer 3 · answered Jul 06 '17 at 07:24

2

Implementation of direct analogue of _mm256_hadd_epi32() for XOR will be look something like this:

#include <immintrin.h>

template<int imm> inline __m256i _mm256_shuffle_epi32(__m256i a, __m256i b)
{
    return _mm256_castps_si256(_mm256_shuffle_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), imm));
}

inline __m256i _mm256_hxor_epi32(__m256i a, __m256i b)
{
    return _mm256_xor_si256(_mm256_shuffle_epi32<0x88>(a, b), _mm256_shuffle_epi32<0xDD>(a, b));
}

int main()
{
    __m256i a = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    __m256i b = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8);
    __m256i c = _mm256_hxor_epi32(a, b);
    return 0;
}

answered Jul 06 '17 at 07:24

ErmIg

3,980
1
27
40

I've edited the question to clarify the goal. Please, look. Also I'm afraid the code like above is slower than to just XOR 64-bit components of `__m256i` register: 4 components need 3 scalar XOR operations. – Serge Rogatch Jul 06 '17 at 08:42
@SergeRogatch Could you write scalar code that you want to optimize with using AVX? – ErmIg Jul 06 '17 at 09:40

Horizontal XOR in AVX

3 Answers3