11

I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer.

Since _mm256_xor_si256 is only available on AVX2, can I load these 256 bits as __m256 using _mm256_load_ps and then do a _mm256_xor_ps. Will this generate expected result?

My major concern is if the memory content is not a valid floating point number, will _mm256_load_ps not loading bits to registers exactly the same as that in memory?

Thanks.

Kan Li
  • 8,557
  • 8
  • 53
  • 93
  • 1
    What happened when you tried it? – Andrew Morton Dec 17 '15 at 10:02
  • 1
    I don't know. But the problem is, unless I tried all possible inputs, which is exponential, I won't be able to be sure there is `__m256_load_ps` will load bits exactly the same as that in memory, right? – Kan Li Dec 17 '15 at 19:18
  • "All possible inputs" in this context means 2^32 bit combinations, which is not as big of a deal as it may seem on any modern machine. Of course it is still a good idea to have a definitive answer (which has been given by now), and not rely on brute-force verification only. – void_ptr Dec 17 '15 at 21:44
  • 2
    @void_ptr: You can only brute-force test on a few specific hardware models. It's always a bad idea to decide something is ok in general just because it works on your machine, without any docs to support the reasoning. e.g. SSE loads/stores wider than 64b are *not* guaranteed to be atomic, but on many machines they are. On Pentium M, they're split into two separate 64b ops. On [multi-socket Opteron, they are extremely rarely not atomic](http://stackoverflow.com/a/7647825/224132). Similarly, some instructions happen to not modify flags on SnB, but the ISA says they're undefined. – Peter Cordes Dec 19 '15 at 09:27
  • Also, updated my answer to point out that moving data into vector regs just for xor isn't worth it if you need it back in integer register for other stuff before you need to store back to memory. – Peter Cordes Dec 19 '15 at 09:30

3 Answers3

13

First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor instructions to get the result (and 4 mov instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor can run on any of the 4 ALU ports). So if you're just doing a single xor in between some add/adc or whatever, stick with integers.

Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq / pinsrq would cost more execution resources than xor. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq / pextrq still suck, but would have lower latency (at the cost of more uops).


FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:

VADDPS

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

VMOVAPS

SIMD Floating-Point Exceptions
None.

(From Intel's insn ref manual. See the wiki for links to that and other stuff.)

On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.

Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.

No existing design I've read about has handled movapd any differently from movaps. Presumably Intel created movapd as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd is movaps with a 66h prefix, just like the double version of every other SSE instruction just has the 66h prefix byte tacked on. Or F2 instead of F3 for scalar instructions.)

Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps as the input for addpd, for example. I don't think movaps between two addpd instructions, or even xorps would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)


Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):

256b operations with AVX xorps

VMOVDQU   ymm0, [A]
VXORPS    ymm0, ymm0, [B]
VMOVDQU   [result], ymm0
  • 3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)

  • load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.

    So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.

    Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.

Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps can run on) is needed once every two clocks.


128b ops

VMOVDQU   xmm0, [A]
VMOVDQU   xmm1, [A+16]
VPXOR     xmm0, xmm0, [B]
VPXOR     xmm1, xmm1, [B+16]
VMOVDQU   [result],    xmm0
VMOVDQU   [result+16], xmm1

This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.


In registers

Between registers, one 256b VXORPS and two 128b VPXOR per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR per clock would give the most XOR-ing per cycle. (XORPS and PXOR do the same thing, but XORPS's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)


Z Boson's hybrid idea:

VMOVDQU   ymm0, [A]
VMOVDQU   ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR     xmm0, xmm0, xmm4
VPXOR     xmm1, xmm1, xmm5
VMOVDQU   [res],    xmm0
VMOVDQU   [res+16], xmm1

Even more fused-domain uops (8) than just doing 128b-everything.

Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.

ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).

So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Great Answer! That's some nice looking assembly. it's interesting that it looks better than intrinsics. I am mildly surprised that my hybrid method has a better throughput than the pure SSE version. – Z boson Dec 18 '15 at 07:49
  • You might want to consider editing your answer to included this answer http://stackoverflow.com/questions/6678073/difference-between-movdqa-and-movaps-x86-instructions – Z boson Dec 18 '15 at 07:52
  • @Zboson: Yeah, not a bad idea. The assembly looks super tidy because I didn't include anything about the inputs, and used placeholders for A, B, and result, instead of a bit comment about which register was pointing where. ASM mnemonics are WAY more readable than the super-long names used for intrinsics. Intrinsic names suck most of the time. They're way longer, but still have bits you need to decode (like epu8 vs. epi32). And for full details on exactly what they do, you need to look them up by asm mnemonic anyway. I'd be happier with `_mm_pshufb(...)`. – Peter Cordes Dec 18 '15 at 09:05
  • Maybe they wanted to keep the `B`yte, `W`ord, `D`word, `Q`word stuff out of C, where people would be confused by the fact that a "word" on a 32bit / 64bit machine like x86 is 16 bits. – Peter Cordes Dec 18 '15 at 09:07
3

There is no problem using _mm256_load_ps to load integers. In fact in this case it's better than using _mm256_load_si256 (which does work with AVX) because you stay in the floating point domain with _mm256_load_ps.

#include <x86intrin.h>
#include <stdio.h>

int main(void) {
    int a[8] = {1,2,3,4,5,6,7,8};
    int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};

    __m256 a8 = _mm256_loadu_ps((float*)a);
    __m256 b8 = _mm256_loadu_ps((float*)b);
    __m256 c8 = _mm256_xor_ps(a8,b8);
    int c[8]; _mm256_storeu_ps((float*)c, c8);
    printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}

If you want to stay in the integer domain you could do

#include <x86intrin.h>
#include <stdio.h>

int main(void) {
    int a[8] = {1,2,3,4,5,6,7,8};
    int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};

    __m256i a8 = _mm256_loadu_si256((__m256i*)a);
    __m256i b8 = _mm256_loadu_si256((__m256i*)b);
    __m128i a8lo = _mm256_castsi256_si128(a8);
    __m128i a8hi = _mm256_extractf128_si256(a8, 1);
    __m128i b8lo = _mm256_castsi256_si128(b8);
    __m128i b8hi = _mm256_extractf128_si256(b8, 1);
    __m128i c8lo = _mm_xor_si128(a8lo, b8lo);
    __m128i c8hi = _mm_xor_si128(a8hi, b8hi);
    int c[8];
    _mm_storeu_si128((__m128i*)&c[0],c8lo);
    _mm_storeu_si128((__m128i*)&c[4],c8hi);
    printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}

The _mm256_castsi256_si128 intrinsics are free.

Paul R
  • 208,748
  • 37
  • 389
  • 560
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • IIRC, Intel CPUs don't have any penalty for using integer load -> FP instruction or vice versa. I forget if AMD CPUs do. On AMD CPUs, `xor_ps` runs in the integer domain, though. – Peter Cordes Dec 17 '15 at 20:10
  • @PeterCordes, if there is no penalty then what's the point in having integer and float load instructions? There could be one type-less load instruction then. – Z boson Dec 17 '15 at 20:22
  • When they designed it, they probably had in mind the possibility of a design where there was less latency. In practice, they ended up making CPUs where both kinds of loads had equal latency. AMD is the same: according to Agner Fog's microarch guide, Bulldozer-family CPUs have the same 6c latency from load-domain to FP-domain or to ivec-domain. Store also doesn't matter what kind of store instruction is used. This is probably a case of just not planning well for the future. And IDK why they introduced `movdqa` with SSE2, instead of just leaving `movaps` – Peter Cordes Dec 17 '15 at 21:10
  • 1
    Actually, in http://stackoverflow.com/questions/6678073/difference-between-movdqa-and-movaps-x86-instructions, Stephen Canon points out that the difference is more likely to manifest in the reg-reg form than in the load/store form. Before SnB, this was a real issue, and using the wrong mov xmm,xmm instruction on Nehalem *does* cause a bypass delay. We were just missing this issue because we were only thinking of their use as loads/stores. – Peter Cordes Dec 17 '15 at 21:15
  • @PeterCordes, great link, thanks! But then what's the point of single and double floating point load instructions? – Z boson Dec 18 '15 at 07:45
1

You will probably find that there is little or no difference in performance than if you used 2 x _mm_xor_si128. It's even possible that the AVX implementation will be slower, since _mm256_xor_ps has a reciprocal throughput of 1 on SB/IB/Haswell, whereas _mm_xor_si128 has a reciprocal throughput of 0.33.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • oh, is throughput of 0.33 better than 1? I was thinking the higher throughput the better. – Kan Li Dec 17 '15 at 10:38
  • @icando: yes, 0.33 means up to three instructions per clock, so your two `_mm_xor_si128` instructions should be able to execute in the same clock cycle, in theory, assuming there are no other dependencies. (See page 3 of Agner Fog's "Instruction Tables": **Reciprocal Throughput**). – Paul R Dec 17 '15 at 10:53
  • 2
    The problem is, if I use 256-bit instructions, it takes me two loads, one _mm256_xor_ps, one store, while four loads, two _mm_xor_si128 and two stores if I use 128-bit instructions. The benefit of _mm_xor_si128 being faster than _mm256_xor_ps might be canceled out by these more loads and stores. I am still interested in my original question. – Kan Li Dec 17 '15 at 19:15
  • @icando: If values aren't already in registers, 256b ops are better. 256b loads/stores have somewhat better theoretical max throughput on SnB (because of a lack of a dedicated store AGU). If values are in registers, max throughput comes from 2x `VPXOR xmm` and 1x `XORPS ymm` per clock, but unpacking / repacking to do this isn't worth it. If your data naturally comes in chunks of 256b, AVX 256b ops sound like a good idea to me, because that will be fewer total uops, so the out-of-order machinery will "see farther". – Peter Cordes Dec 17 '15 at 21:23
  • 2
    Note that 256b ops are slow for the first few thousand clock cycles or something, until the CPU decides to stop emulating them and bring the upper halves of the execution units out of power-saving mode or whatever it is. See http://www.agner.org/optimize/blog/read.php?i=142#378 for some discussion about warm-up times for 256b ops. – Peter Cordes Dec 17 '15 at 21:24
  • 1
    @icando: you could be right, but more instructions doesn't necessarily mean slower - it will be interesting to try it both ways and benchmark though, but I suspect there will be little difference. – Paul R Dec 17 '15 at 21:26
  • 2
    @PaulR: See my answer: If loads/stores are needed, then the 256b XOR unit on port5 is only needed once per 2 cycles, so nowhere near saturating. Even if one of the values is already loaded in memory, or you don't need to store the result, you're unlikely to bottleneck on port5. If a lot of values stay live in registers, then mixing 256b xorps and 128b pxor could be good, but unpacking/repacking isn't worth it. Anyway, I think I'm just repeating myself at this point. >.< **You're right, benchmarks from the OPs actual use-case are the way to decide.** – Peter Cordes Dec 17 '15 at 21:33