selectively xor-ing elements of a list with AVX2 instructions

Question

I want to speed up the following operation with AVX2 instructions, but I was not able to find a way to do so.

I am given a large array uint64_t data[100000] of uint64_t's, and an array unsigned char indices[100000] of bytes. I want to output an array uint64_t Out[256] where the i-th value is the xor of all data[j] such that index[j]=i.

A straightforward implementation of what I want is this:

uint64_t Out[256] = {0};     // initialize output array
for (i = 0; i < 100000 ; i++) {
    Out[Indices[i]] ^= data[i];
}

Can we implement this more efficiently with AVX2 instructions?

EDIT : This is what my code looks like now

uint64_t Out[256][4] = {0};   // initialize output array
for (i = 0; i < 100000 ; i+=4) {
    Out[Indices[i  ]][0] ^= data[i];
    Out[Indices[i+1]][1] ^= data[i+1];
    Out[Indices[i+2]][2] ^= data[i+2];
    Out[Indices[i+3]][3] ^= data[i+3];
}

Probably not. AVX2 doesn't have scatter instructions. AVX-512 has, but the AVX-512 scatter instructions are not very fast, see Agner Fog's instruction tables. Unrolling might help a bit, see [here](https://stackoverflow.com/a/12986321), which is a more or less similar problem, but with `+` instead of `^`. With AVX-512 it is possible to vectorize such an unrolled loop: To avoid write conflicts with AVX-512, you can scatter the data into `Out1,...,Out4', and combine the results afterwards. — wim, May 29 '18 at 12:39
Thank you, I tried unrolling the loop. My profiler (callgrind) reported a much smaller cycle count (almost x2 less). But I did not notice a speedup in practice :( — Ward Beullens, May 29 '18 at 12:53
This is a histogram problem, with XOR instead of ADD. And instead of a constant `1`, the value to ADD/XOR also comes from an array. — Peter Cordes, May 29 '18 at 13:11
That is strange. Did you only unroll, or have you also tried multiple accumulators `Out1,...,Out4`? — wim, May 29 '18 at 13:14
@wim: yeah, that was my thought. Use the standard histogram optimization (for small numbers of bins) of multiple arrays to avoid long dependency chains through memory store/reload. AVX2 can very nicely XOR the arrays together down to all the results in `Out1`. — Peter Cordes, May 29 '18 at 13:17
@wim I only did the unrolling. Not the scattering or the multiple accumulators. How would I use the scattering with AVX-512? Is there some scatters instruction that xors instead of overwrites? — Ward Beullens, May 29 '18 at 13:24
I still don't see a speedup. Is it possible that my compiler was already doing the unrolling by itself? (I am compiling with -O3) — Ward Beullens, May 29 '18 at 13:43
I would try `Out[4][256]` instead of `Out[256][4]`, but I am not sure if that helps. You can check the compiler output with `-S`. — wim, May 29 '18 at 14:03
One approach that is "mostly scalar" but might beat any `AVX2` approach for longer inputs would be to tackle 2 bytes at a time: accumulate into `Out2` which has `2^16` entries, and index into it using two consecutive bytes from `Indices` and `xor` it against 2 consecutive `data` bytes. If you choose 16-bit (or 32-bit) counters the accumulation table will nicely (or barely) fit into L2 and you might get close a 2x speedup. At the end you need to reduce from `Out2` into the expected 256-entry `Out` which can be done embarrassingly parallel with 32-byte `AVX2` adds. — BeeOnRope, May 31 '18 at 01:13
Of course, this approach with its large cache impact and fixed-cost reduction step probably isn't great for an input size of `100,000`, but for a million or more elements it probably does OK. If the data is highly skewed, the cache impact during the histogram phase might be much more limited that the calculated worst case since many table entries may not ever be touched - but it is clear how to avoid touching those values anyways during the final reduction phase. — BeeOnRope, May 31 '18 at 01:14

Peter Cordes · Accepted Answer · 2018-06-08T12:03:41.093

Based on static analysis for Haswell/Skylake, I came up with a version that runs in ~5 cycles per 4 i values, instead of 8 cycles, when compiled by gcc. Average for large sizes, not including the time to combine multiple copies of Out[], and assuming a random distribution of Indices that doesn't lead to any store/reload dep chains running for long enough to matter.

IDK if you care about Ryzen or Excavator (the other 2 mainstream AVX2 microachitectures).

I haven't done a careful analysis by hand, but IACA is wrong for HSW/SKL and thinks some instructions don't micro-fuse when in fact they do (tested on an i7-6700k with perf counters), so it thinks the front-end bottleneck is more severe than it really is. e.g. movhps load+merge micro-fuses, but IACA thinks it doesn't even with simple addressing modes.

We should have negligible any cache misses, because uint64_t Out[4][256] is only 8kiB. So our cache footprint is only 1/4 of L1d size on most recent CPUs, and should be mostly fine even with hyperthreading sharing L1d between two logical threads. Looping over data[] and Indices[] should prefetch well, and hopefully doesn't evict Out[] much. Thus, static analysis has a good chance of being somewhat accurate, and it's quicker than careful micro-benchmarking and more importantly tells you exactly what the bottlenecks are.

But of course we're relying heavily on out-of-order execution and imperfect scheduling or other unexpected bottlenecks could easily happen. I didn't feel like actually microbenchmarking if I'm not getting paid, though.

Can we implement this more efficiently with AVX2 instructions?

This is basically a histogram problem. The usual histogram optimization of using multiple tables and combining at the end applies. SIMD XOR is useful for the combine-at-the-end (as long as you use Out[4][256], not Out[256][4]. The latter also makes the indexing slower by requiring scaling by 8*4 instead of 8 (which can be done with a single LEA in a scaled-index addressing mode)).

But unlike a normal histogram, you're XORing in some data from memory instead of ADDing a constant 1. So instead of an immediate 1, the code has to load data[i] into a register as a source for xor. (Or load, then xor reg, data[i] / store). This is even more total memory operations than a histogram.

We come out ahead from "manual" gather/scatter into SIMD vectors (using movq / movhps loads/stores), allowing us to use SIMD for the data[i] load and XOR. This reduces the total number of load operations, and thus reduces load-port pressure without costing extra front-end bandwidth.

Manual gather into 256-bit vectors probably is probably not worth the extra shuffling (an extra vinserti128 / vextracti128 just so we can combine 2 memory-source vpxors into one 256-bit one). 128-bit vectors should be good. Front-end throughput is also a major issue, because (on Intel SnB-family CPUs) you want to avoid indexed addressing modes for the stores. gcc uses lea instructions to calculate addresses in registers, instead of using indexed loads/stores. clang / LLVM with -march=skylake decides not to, which is a bad decision in this case because the loop bottlenecks on port 2 / port 3, and spending extra ALU uops to allow stores-address uops to use port 7 is a win. But if you're not bottlenecked on p23, spending extra uops to avoid indexed stores is not good. (And in cases where the can stay micro-fused, definitely not just to avoid indexed loads; silly gcc). Maybe gcc and LLVM's addressing-mode cost models aren't very accurate, or they don't model the pipeline in enough detail to figure out when a loop bottlenecks on the front-end vs. a specific port.

Choice of addressing-modes and other asm code-gen choices are critical for this to perform optimally on SnB-family. But writing in C gives you no control over that; you're mostly at the mercy of the compiler, unless you can tweak the source to get it to make a different choice. e.g. gcc vs. clang makes a significant difference here.

On SnB-family, a movhps load needs port 5 for the shuffle/blend (although it does micro-fuse into one uop), but a movhps store is a pure store with no ALU uop. So it's break-even there, and lets us use one SIMD load / XOR for two data elements.

With AVX, unaligned memory source operands are allowed for ALU uops, so we don't need to require alignment for data[]. But Intel HSW/SKL can keep an indexed addressing mode micro-fused with pxor but not vpxor. So compiling without AVX enabled can be better, allowing the compiler to use an indexed addressing mode instead of incrementing a separate pointer. (Or making it faster if the compiler doesn't know this and uses an indexed addressing mode anyway.) TL:DR: probably best to require 16-byte aligned data[] and compile that function with AVX disabled, for better macro-fusion. (But then we miss out on 256-bit SIMD for combining the Out slices at the end, unless we put that in a different function compiled with AVX or AVX2)

Avoiding unaligned loads will avoid any cache-line splits, too, which doesn't cost extra uops but we're probably close to bottlenecking on L1d throughput limits, not just load/store execution unit throughput limits.

I also looked at loading 4 indices at once and unpacking with ALU instructions. e.g. with memcpy into struct { uint8_t idx[4]; } idx;. But gcc generates multiple wasted instructions for unpacking that. Too bad x86 doesn't have great bitfield instructions like ARM ubfx or especially PowerPC rlwinm (which could leave the result left-shifted for free, so if x86 had that, a static Out could have used a base+disp32 addressing mode in non-PIC code.)

Unpacking a dword with shift / movzx from AL/AH is a win if we're using scalar XOR, but it looks like it's not when we're using SIMD for data[] and spending front-end throughput on lea instructions to allow store-address uops to run on port 7. That makes us front-end bottlenecked instead of port2/3 bottlenecked, so using 4x movzx loads from memory looks best according to static analysis. Would be worth benchmarking both ways if you take the time to hand-edit the asm. (The gcc-generated asm with extra uops is just bad, including a completely redundant movzx after right-shifting by 24, leaving the upper bits already zero.)

The code

(See it on the Godbolt compiler explorer, along with a scalar version):

#include <immintrin.h>
#include <stdint.h>
#include <string.h>
#include <stdalign.h>

#ifdef IACA_MARKS
#include "/opt/iaca-3.0/iacaMarks.h"
#else
#define IACA_START
#define IACA_END
#endif

void hist_gatherscatter(unsigned idx0, unsigned idx1,
                       uint64_t Out0[256], uint64_t Out1[256],
                       __m128i vdata) {
    // gather load from Out[0][?] and Out[1][?] with movq / movhps
    __m128i hist = _mm_loadl_epi64((__m128i*)&Out0[idx0]);
    hist = _mm_castps_si128(   // movhps into the high half
               _mm_loadh_pi(_mm_castsi128_ps(hist), (__m64*)&Out1[idx1]));

    // xorps could bottleneck on port5.
    // Actually probably not, using __m128 the whole time would be simpler and maybe not confuse clang
    hist = _mm_xor_si128(hist, vdata);

    // scatter store with movq / movhps
    _mm_storel_epi64((__m128i*)&Out0[idx0], hist);
    _mm_storeh_pi((__m64*)&Out1[idx1], _mm_castsi128_ps(hist));
}

void ext(uint64_t*);

void xor_histo_avx(uint8_t *Indices, const uint64_t *data, size_t len)
{
    alignas(32) uint64_t Out[4][256] = {{0}};

    // optional: peel the first iteration and optimize away loading the old known-zero values from Out[0..3][Indices[0..3]].

    if (len<3)   // not shown: cleanup for last up-to-3 elements.
        return;

    for (size_t i = 0 ; i<len ; i+=4) {
        IACA_START
        // attempt to hand-hold compiler into a dword load + shifts to extract indices
        // to reduce load-port pressure
        struct { uint8_t idx[4]; } idx;
#if 0
        memcpy(&idx, Indices+i, sizeof(idx));  // safe with strict-aliasing and possibly-unaligned
   //gcc makes stupid asm for this, same as for memcpy into a struct,
   // using a dword load into EAX (good),
   // then AL/AH for the first 2 (good)
   // but then redundant mov and movzx instructions for the high 2

   // clang turns it into 4 loads

/*
     //Attempt to hand-hold gcc into less-stupid asm
     //doesn't work: same asm as the struct
        uint32_t tmp;
        memcpy(&tmp, Indices+i, sizeof(tmp));  // mov eax,[mem]
        idx.idx[0] = tmp;     //movzx reg, AL
        idx.idx[1] = tmp>>8;  //movzx reg, AH
        tmp >>= 16;           //shr   eax, 16
        idx.idx[2] = tmp;     //movzx reg, AL
        idx.idx[3] = tmp>>8;  //movzx reg, AH
*/
#else
       // compiles to separate loads with gcc and clang
        idx.idx[0] = Indices[i+0];
        idx.idx[1] = Indices[i+1];
        idx.idx[2] = Indices[i+2];
        idx.idx[3] = Indices[i+3];
#endif

        __m128i vd = _mm_load_si128((const __m128i*)&data[i]);
        hist_gatherscatter(idx.idx[0], idx.idx[1], Out[0], Out[1], vd);

        vd = _mm_load_si128((const __m128i*)&data[i+2]);
        hist_gatherscatter(idx.idx[2], idx.idx[3], Out[2], Out[3], vd);
    }
    IACA_END


   // hand-hold compilers into a pointer-increment loop
   // to avoid indexed addressing modes.  (4/5 speedup on HSW/SKL if all the stores use port7)
    __m256i *outp = (__m256i*)&Out[0];
    __m256i *endp = (__m256i*)&Out[3][256];
    for (; outp < endp ; outp++) {
        outp[0] ^= outp[256/4*1];
        outp[0] ^= outp[256/4*2];
        outp[0] ^= outp[256/4*3];
    }
    // This part compiles horribly with -mno-avx, but does compile
    // because I used GNU C native vector operators on __m256i instead of intrinsics.

/*
    for (int i=0 ; i<256 ; i+=4) {
        // use loadu / storeu if Out isn't aligned
        __m256i out0 = _mm256_load_si256(&Out[0][i]);
        __m256i out1 = _mm256_load_si256(&Out[1][i]);
        __m256i out2 = _mm256_load_si256(&Out[2][i]);
        __m256i out3 = _mm256_load_si256(&Out[3][i]);
        out0 = _mm256_xor_si256(out0, out1);
        out0 = _mm256_xor_si256(out0, out2);
        out0 = _mm256_xor_si256(out0, out3);
        _mm256_store_si256(&Out[0][i], out0);
    }
*/

    //ext(Out[0]);  // prevent optimizing away the work
    asm("" :: "r"(Out) : "memory");
}

Compiled with gcc7.3 -std=gnu11 -DIACA_MARKS -O3 -march=skylake -mno-avx, and analyzed with IACA-3.0:

$ /opt/iaca-3.0/iaca xor-histo.iaca.o                                                                             Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
Analyzed File -  xor-histo.iaca.o
Binary Format - 64Bit
Architecture  -  SKL
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 5.79 Cycles       Throughput Bottleneck: FrontEnd
Loop Count:  22 (this is fused-domain uops.  It's actually 20, so a 5 cycle front-end bottleneck)
Port Binding In Cycles Per Iteration:
--------------------------------------------------------------------------------------------------
|  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
--------------------------------------------------------------------------------------------------
| Cycles |  2.0     0.0  |  3.0  |  5.5     5.1  |  5.5     4.9  |  4.0  |  3.0  |  2.0  |  3.0  |
--------------------------------------------------------------------------------------------------

DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3)
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion occurred
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of   |                    Ports pressure in cycles                         |      |
|  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
-----------------------------------------------------------------------------------------
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx r8d, byte ptr [rdi]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx edx, byte ptr [rdi+0x2]
|   1      |             |      |             |             |      |      | 1.0  |      | add rdi, 0x4
|   1      |             |      |             |             |      |      | 1.0  |      | add rsi, 0x20
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx eax, byte ptr [rdi-0x1]
|   1      |             | 1.0  |             |             |      |      |      |      | lea r12, ptr [rcx+r8*8]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx r8d, byte ptr [rdi-0x3]
|   1      |             | 1.0  |             |             |      |      |      |      | lea rdx, ptr [r10+rdx*8]
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movq xmm0, qword ptr [r12]
|   1      |             |      |             |             |      | 1.0  |      |      | lea rax, ptr [r9+rax*8]
|   1      |             | 1.0  |             |             |      |      |      |      | lea r8, ptr [r11+r8*8]
|   2      |             |      | 0.5     0.5 | 0.5     0.5 |      | 1.0  |      |      | movhps xmm0, qword ptr [r8]   # Wrong, 1 micro-fused uop on SKL
|   2^     | 1.0         |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | pxor xmm0, xmmword ptr [rsi-0x20]
|   2^     |             |      | 0.5         | 0.5         | 1.0  |      |      |      | movq qword ptr [r12], xmm0   # can run on port 7, IDK why IACA chooses not to model it there
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movhps qword ptr [r8], xmm0
|   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movq xmm0, qword ptr [rdx]
|   2      |             |      | 0.5     0.5 | 0.5     0.5 |      | 1.0  |      |      | movhps xmm0, qword ptr [rax]  # Wrong, 1 micro-fused uop on SKL
|   2^     | 1.0         |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | pxor xmm0, xmmword ptr [rsi-0x10]
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movq qword ptr [rdx], xmm0
|   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movhps qword ptr [rax], xmm0
|   1*     |             |      |             |             |      |      |      |      | cmp rbx, rdi
|   0*F    |             |      |             |             |      |      |      |      | jnz 0xffffffffffffffa0
Total Num Of Uops: 29  (This is unfused-domain, and a weird thing to total up).

gcc8.1 on Godbolt uses a scaled-index addressing mode for pxor, using the same counter for Indices and data[], so that saves an add.

clang doesn't use LEA, and bottlenecks at 4 is per 7 cycles, because none of the store uops can run on port 7.

The scalar version (still using 4 slices of Out[4][256]):

$ iaca.sh -mark 2 xor-histo.iaca.o                               
Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
Analyzed File - xor-histo.iaca.o
Binary Format - 64Bit
Architecture  - SKL
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 2
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 7.24 Cycles       Throughput Bottleneck: FrontEnd

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 3.0    0.0  | 3.0  | 6.2    4.5  | 6.8    4.5  | 4.0  | 3.0  | 3.0  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
X - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov eax, dword ptr [rdi]
|   1    | 0.4       | 0.5 |           |           |     | 0.1 |     |     |    | add rdi, 0x4
|   1    |           | 0.7 |           |           |     | 0.3 |     |     |    | add rsi, 0x20
|   1*   |           |     |           |           |     |     |     |     |    | movzx r9d, al
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rdx, qword ptr [rbp+r9*8-0x2040]
|   2^   |           | 0.3 | 0.5   0.5 | 0.5   0.5 |     | 0.3 | 0.4 |     |    | xor rdx, qword ptr [rsi-0x20]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+r9*8-0x2040], rdx  # wrong, HSW/SKL can keep indexed stores fused
|   1*   |           |     |           |           |     |     |     |     |    | movzx edx, ah
|   1    |           |     |           |           |     | 0.4 | 0.6 |     |    | add rdx, 0x100
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov r9, qword ptr [rbp+rdx*8-0x2040]
|   2^   | 0.6       | 0.2 | 0.5   0.5 | 0.5   0.5 |     | 0.2 | 0.1 |     |    | xor r9, qword ptr [rsi-0x18]
|   2    |           |     | 0.2       | 0.8       | 1.0 |     |     |     |    | mov qword ptr [rbp+rdx*8-0x2040], r9  # wrong, HSW/SKL can keep indexed stores fused
|   1*   |           |     |           |           |     |     |     |     |    | mov edx, eax   # gcc code-gen isn't great, but not as bad as in the SIMD loop.  No extra movzx, but not taking advantage of AL/AH
|   1    | 0.4       |     |           |           |     |     | 0.6 |     |    | shr eax, 0x18
|   1    | 0.8       |     |           |           |     |     | 0.2 |     |    | shr edx, 0x10
|   1    |           | 0.6 |           |           |     | 0.3 |     |     |    | add rax, 0x300
|   1*   |           |     |           |           |     |     |     |     |    | movzx edx, dl
|   1    | 0.2       | 0.1 |           |           |     | 0.5 | 0.2 |     |    | add rdx, 0x200
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov r9, qword ptr [rbp+rdx*8-0x2040]
|   2^   |           | 0.6 | 0.5   0.5 | 0.5   0.5 |     | 0.3 | 0.1 |     |    | xor r9, qword ptr [rsi-0x10]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+rdx*8-0x2040], r9  # wrong, HSW/SKL can keep indexed stores fused
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rdx, qword ptr [rbp+rax*8-0x2040]
|   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 0.6 | 0.4 |     |    | xor rdx, qword ptr [rsi-0x8]
|   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+rax*8-0x2040], rdx  # wrong, HSW/SKL can keep indexed stores fused
|   1    | 0.6       |     |           |           |     |     | 0.4 |     |    | cmp r8, rdi
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffff75
Total Num Of Uops: 33

The loop is 4 fused-domain uops shorter than what IACA counts, because it doesn't know that only SnB/IvB un-laminate indexed stores. HSW/SKL don't. Such stores still can't use port 7, though, so this won't get any better than ~6.5 cycles for 4 elements.

(And BTW, with naive handling of Indices[i], loading each one separately with movzx, you get 8 cycles for 4 elements, saturating ports 2 and 3. Even though gcc doesn't generate throughput-optimal code for unpacking the struct, the 4-byte load + unpack should be a net win by relieving some load-port pressure.)

Cleanup loop:

AVX2 really shines here: we loop over the lowest slice of the histogram, and XOR in the other slices. This loop is 8 front-end uops with 4 loads on Skylake, and should run at 1 iter per 2 clocks:

.L7:
    vmovdqa ymm2, YMMWORD PTR [rax+4096]
    vpxor   ymm0, ymm2, YMMWORD PTR [rax+6144]
    vmovdqa ymm3, YMMWORD PTR [rax]
    vpxor   ymm1, ymm3, YMMWORD PTR [rax+2048]
    vpxor   ymm0, ymm0, ymm1
    vmovdqa YMMWORD PTR [rax], ymm0
    add     rax, 32
    cmp     rax, rdx
    jne     .L7

I tried to reduce the uop count further by doing the XORs in one chain, but gcc insists on doing two vmovdqa loads and having to do one vpxor without a memory operand. (OoO exec will hide the latency of this tiny chain / tree of VPXOR so it doesn't matter.)

How would I use the scattering with AVX-512? Is there some scatters instruction that xors instead of overwrites?

No, you'd use a gather to get the old values, then SIMD XOR, then scatter the updated elements back to the locations they came from.

To avoid conflicts, you might want out[8][256] so every vector element can use a different table. (Otherwise you have a problem if Indices[i+0] and Indices[i+4] were equal, because the scatter store would just store the highest vector element with that index.

Scatter/gather instructions need a single base register, but you can simply add _mm256_setr_epi64(0, 256, 256*2, ...); after doing a vpmovzxbq zero-extending load.

Notes

I used IACA2.3 for the scalar analysis because IACA3.0 seems to have removed the -mark option to choose which loop to analyze when you have multiple marks in one file. IACA3.0 didn't fix any of the ways that IACA2.3 is wrong about SKL's pipeline in this case.

More optimization is possible, e.g. by skipping the LEA to feed the `movq` load/store and letting that use indexed addressing modes. But Intel's IACA can't model that because it has wrong info about how Intel CPU work. :/ That could save 2 uops, but then the store couldn't use port 7, so maybe it'd be a loss. Possibly worth it to do *one* of the two `movq` loads/stores that way, and have 3 of the 4 stores use port 7 (the way IACA modeled it anyway). — Peter Cordes, May 30 '18 at 13:21
Wow, thank you very much! I am very new to AVX2 programming, so it will take me some time to fully understand your answer though. — Ward Beullens, May 30 '18 at 13:54
@WardBeullens: Note that the main loop only uses SSE2 intrinsics because that's better. For gcc7, compiling with `-march=haswell` to enable AVX doesn't hurt, but only helps for the cleanup loop. Hopefully you don't have to try to get clang to make nice code for this. And I spent most of the answer talking about *how* it compiled to asm, because in this case that's the important part. Refer to http://agner.org/optimize/ and [Micro fusion and addressing modes](https://stackoverflow.com/q/26046634) to look up any terms or other stuff you don't understand. — Peter Cordes, May 30 '18 at 14:23
@WardBeullens: The point I meant to make was that this answer is a terrible intro to AVX2 or SIMD, because the microarchitectural details and compiler code-gen choices are the major obstacle without AVX512F. — Peter Cordes, May 30 '18 at 14:42
@BeeOnRope: [Your unpacking idea](https://stackoverflow.com/questions/50583718/selectively-xor-ing-elements-of-a-list-with-avx2-instructions/50584204?noredirect=1#comment88240345_50584204) of merging into the low byte of an address doesn't work. A slice of `Out` is `256 * sizeof(uint64_t)`, so you need your indices scaled by 8 *after* unpacking. Your best bet is still `shr rax,16` / `movzx ecx, al` / `movzx edx, ah`, or some `pextr`. Too bad x86 doesn't have PowerPC bitfield extract/insert instructions to do your merge idea into bits `[10:3]` instead of `[7:0]` of an aligned `Out`. — Peter Cordes, May 31 '18 at 02:07
Ryzen only has 2 AGUs. Max cache throughput is 2 ops per clock, up to one of them a store. You *definitely* want to use 128-bit vectors there. AFAIK there's no penalty for indexed addressing modes there; but Agner missed un-lamination on Intel so that's no guarantee. (He tested how they fit in the uop cache, instead of issue perf counters). — Peter Cordes, May 31 '18 at 02:09
@PeterCordes - good point, I forgot about shifting the address! It adds a uop to shift probably with `lea`, at which point I think it ties the `movzx` approach (and is perhaps more susceptible to weirdness around merging the low byte and "false" dependency on the address reg). Still, it seems roughly competitive with the vectorized approach. It also seems promising to calculate the full addresses for the each access to `Out` in a vectorized way and then do the scalar `xor` RMW, but there just isn't a great way to unpack values in a vector register to scalar. — BeeOnRope, May 31 '18 at 02:35
It would be nice to have a 4-uop operation that moved 4 64-bit values from a `ymm` into `rax` through `rdx` or whatever you wanted. — BeeOnRope, May 31 '18 at 02:36
@BeeOnRope: Yeah, it just about ties for uop cost. movzx + lea per index is better than a byte-merge + shl or rorx (with a register with a right-shifted address), because the `movzx r32, r8` doesn't need an ALU port. And more importantly, avoids a false dep on the register holding the base address. I think merge into a 128-bit xmm register is pure win for front-end throughput, because scalar XOR needs (per index) 1 pure load, 1 fused xor+load, and 1 fused store. It doesn't matter whether the pure load is the data or the initial Out histo entry, or whether you use `xor m,r`. Fused pxor wins — Peter Cordes, May 31 '18 at 03:14
Correct, 1 uop per address (I didn't mean otherwise). I wasn't going to use `rorx` but rather just `lea rdx, [rax*8]`. Agner lists `movzx r,r` as requiring a p0156 op on HSW and SKL, but the doc could be wrong. Yes about the dependency on the merge register. Yes the vector solution seems to be slightly faster based on the analysis above, but it is close: you'd have to hope there are no vector-specific bottlenecks on whatever platform you care about and that the scheduling is good (perhaps the latter is even more true of the scalar case, I'm not sure). — BeeOnRope, May 31 '18 at 03:29
@BeeOnRope: I edited my prev comment after you started your reply, I think. I took out the "per address" because I realized it was obvious. Anyway, [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](https://stackoverflow.com/q/45660139): Agner is wrong, HSW/SKL definitely handle `movzx eax, cl` at rename time, with zero-latency. IDK why his docs say that IvB has it but HSW doesn't. Anyway yeah, I'm curious about actual benchmarks, but not enough to make up a fake data set with some store/reload repeats. — Peter Cordes, May 31 '18 at 03:32
Interesting. Agner does specifically mention it in the microarch guide too: in Ivy Bridge he explicitly says 8-bit `movzx` can be eliminated, then in Skylake he says that `movzx` cannot be eliminated. He made a similar, but slightly inconsistent change in his instruction tables: in Ivy Bridge the `movzx r32/64,r8` version had its own line with "may be eliminated" but then this was removed in Haswell (inconsistent because the uarch guide implies the change happened in Skylake). — BeeOnRope, May 31 '18 at 03:44
@BeeOnRope: No, Agner mostly copy/pasted the HSW text for SKL, instead of describing the changes relative to HSW. The HSW entry says the same thing "unlike IvB"... Perhaps because I once complained about following a chain of multiple "see back here" references from Core2 to Pentium M to PIII for something; I think he didn't have time to write a really nice version and erred on the side of copying instead of referencing. Recent updates have seemed rushed. — Peter Cordes, May 31 '18 at 04:00
I see I was looking for `movzx` but that text didn't appear there. Anyways what's weird is that he clearly changed this, starting with "unlike" so somehow his test that measured elimination stopped showing that for HSW or some other error was made, or it doesn't work in the loop buffer, etc. — BeeOnRope, May 31 '18 at 04:16
@PeterCordes FWIW, Intel optimization guide says that `movzx` does mov-eliminate, but only the `al` form, not the `ah` form. Really interesting example about recovering resources in example 3-25 that I saw you mention somewhere. — BeeOnRope, Jun 01 '18 at 02:08
@BeeOnRope: That's correct and backed up by my experiments. AH/BH/CH/DH are special. `movzx` from high8 is worth using when throughput is the goal, not latency. `mov eax, edx` / `shr eax, 8` (or rorx) / `movzx ecx, al` has 1c latency, but `movzx ecx, dh` has 2c latency on HSW/SKL, but is still 1 uop. I mentioned example 3-25 in [Can x86's MOV really be "free"? Why can't I reproduce this at all?](https://stackoverflow.com/q/44169342), where I found that SKL does indeed do more mov-elimination, but with lower total throughput. (Because it has 4 integer ALU ports, unlike IvB's 3.) — Peter Cordes, Jun 01 '18 at 02:20

score 0 · Answer 2 · answered May 29 '18 at 12:16

0

You can sort the data according to indices[i]... This should take O(N*log2(N)), but that can be parallelized.

Then taking the cumulative xor of the sorted data -- which can be also parallelized.

Then it's the matter of calculating Out[i] = CumXor(j) ^ Out[i-1];

answered May 29 '18 at 12:16

Aki Suihkonen

19,144
1
36
57

Certainly, this is an interesting idea. Nevertheless,I would be surprised if this is any faster than the scalar solution, in practice. – wim May 29 '18 at 12:47
Yes, I don't believe that this can be faster than the scalar implementation in practice. – Ward Beullens May 29 '18 at 13:01
1

I could see this being useful if you sometimes make small changes to `data[]` and redo the calculation. If your sort records a mapping array, you can map an update to `data[i]` to `sorteddata[map[i]]`. Or maybe it would be easier to use XOR the old data out of `Out[Indices[i]]` and then XOR in the new data, i.e. update the histogram on the fly when storing to the array. – Peter Cordes May 29 '18 at 13:15
3

The scalar solution probably runs in 1.5 to 2 cycles per element (limited by load throughput) for cached data and the memory access pattern (a single stream through the data) is optimal for prefetching. It will be very hard to come up with a sort that works on 100,000 elements in 1.5 - 2 cycles. – BeeOnRope May 29 '18 at 20:06
@BeeOnRope: Using manual gather/scatter into an XMM reg lets us use `pxor xmm0, data[i]`, saving load-port throughput. According to IACA, my version might run in ~5 to 5.5 cycles per 4 elements on HSW/SKL (after correcting for IACA's wrong data about `movhps` micro-fusion. Stupid buggy closed-source tools.) Anyway, that raises the bar even further for any other approach to compete with. – Peter Cordes May 30 '18 at 13:23
@PeterCordes - cool solution. A scalar solution could get pretty close. You could read 8 bytes of `Indices` at a time to reduce load port and AGU pressure at the cost of unpacking. For the unpacking, if your indices were in `rbx`, for example, and the pointer to the `Out` array in `rax`, if `rax` was 256-byte aligned, you could do some clever unpacking like `mov al, bl` and `mov al, bh`, which does the address addition and the "unpack" in a single instruction (then you'll need some shift for the next two bytes, etc). You want a single reg for the addressing as you point to use port7. – BeeOnRope May 31 '18 at 00:58
Agner lists RMW `xor` as `2p0156 2p237 p4`, which seems like 5 total unfused uops, but his table only has it as 4 (2 fused). I think it's wrong and should be `p0156 2p237 p4` since there is no reason for 2 ALU ops. It's too bad the RMW can't re-use the AGU that has to happen for the load for the store as well, that would be a big win here and based on my limited understanding it seems possible. Anyways, it seems like you could get under 5 fused uops per operation: 2 for the `xor` RMW, 1 load for the `data[i]` read, plus X ops to calculate the the RMW address. – BeeOnRope May 31 '18 at 01:03
Where X is something like 8 `mov` (8-bit), 3 `shr` and 1 load per 8 elements, so about 1.5 uops average? So potentially The load port pressure is 2.125 loads/op if port7 AGU is used perfectly by every store component. So if everything pans out, maybe you could get to 4.5 fused uops per op, and so do 1.125 cycles per element limited by the 4-wide. Ryzen could do slightly better if all timings are like Intel but being 5-wide (but I'm not sure about the byte-`mov` tricks on Ryzen, or even Intel for that matter). – BeeOnRope May 31 '18 at 01:06
@BeeOnRope: I replied to your comments [under my answer](https://stackoverflow.com/questions/50583718/selectively-xor-ing-elements-of-a-list-with-avx2-instructions/50584204?noredirect=1#comment88241174_50605632). – Peter Cordes May 31 '18 at 02:09

selectively xor-ing elements of a list with AVX2 instructions

2 Answers2

The code

Linked