23

I want to vectorize the multiplication of two memory aligned arrays. I didn't find any way to multiply 64*64 bit in AVX/AVX2, so I just did loop-unroll and AVX2 loads/stores. Is there a faster way to do this?

Note: I don't want to save the high-half result of each multiplication.

void multiply_vex(long *Gi_vec, long q, long *Gj_vec){

    int i;
    __m256i data_j, data_i;
    __uint64_t *ptr_J = (__uint64_t*)&data_j;
    __uint64_t *ptr_I = (__uint64_t*)&data_i;


    for (i=0; i<BASE_VEX_STOP; i+=4) {
        data_i = _mm256_load_si256((__m256i*)&Gi_vec[i]);
        data_j = _mm256_load_si256((__m256i*)&Gj_vec[i]);

        ptr_I[0] -= ptr_J[0] * q;
        ptr_I[1] -= ptr_J[1] * q;
        ptr_I[2] -= ptr_J[2] * q;
        ptr_I[3] -= ptr_J[3] * q;

        _mm256_store_si256((__m256i*)&Gi_vec[i], data_i);
    }


    for (; i<BASE_DIMENSION; i++)
        Gi_vec[i] -= Gj_vec[i] * q;
}

UPDATE: I'm using the Haswell microarchitecture with both ICC/GCC compilers. So both AVX and AVX2 is fine. I substitute the -= by the C intrisic _mm256_sub_epi64 after the multiplication loop-unroll, where it get some speedup. Currently, it is ptr_J[0] *= q; ...

I use __uint64_t but is a error. The right data type is __int64_t.

  • 2
    If you do it this way you are taking a huge penalty of moving from simd register to alu register. Just doesn't worth it. – user3528438 May 18 '16 at 10:12
  • gcc generates Karatsuba-ish code using 3 32x32→64 multiplications, 3 32-bit-shifts and two adds. Looks to be fairly good for ILP, too. – EOF May 18 '16 at 12:21
  • 2
    @EOF It isn't quite Karatsuba-ish. The 64x64 to low 64-bit multiply doesn't need the top half. So you don't need the high x high multiply at all. That leaves the other 3. – Mysticial May 18 '16 at 15:12
  • @EOF: I only see two 64bit shifts (with a count of 32) in gcc5.3's auto-vectorized output. There are two adds, though. (The sub in the output is from the `-=`). Anyway, on Haswell, it's potentially slower than an unrolled hand-tuned scalar asm loop (load/`imul`/`sub`-with-memory-dest), bottlenecking on load/store throughput after unrolling. With C intrinsics and compiler output, you're probably going to get faster code with vectors, though. – Peter Cordes May 19 '16 at 10:29
  • @davmac: for gcc, `__m256i` is defined with a "may alias" attribute, which is why it does what the user expects with gcc. You're absolutely right that it's not safe or a good idea in general. A union with a struct is a better idea. – Peter Cordes May 19 '16 at 10:45
  • [This](http://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit) may be of interest. – Z boson May 19 '16 at 11:25
  • 2
    It's not clear to me if you want a AVX or AVX2 solution (or both). It makes a big difference. – Z boson May 19 '16 at 11:39
  • @user3528438 Thank you for the note. @PeterCordes I substitute the `-=` by the C intrisic `_mm256_sub_epi64` after the multiplication loop-unroll. Right now I have `ptr_J[0] *= q;`. I get a speedup from this. @Zboson I'm using a Haswell microarchitecture, so I can use both. – Hélder Gonçalves May 19 '16 at 14:21
  • 1
    @HélderGonçalves: you'll get a bigger speedup from letting gcc auto-vectorize the whole thing, or using the intrinsics code from my answer. Or even from plain scalar code without vectorization. Your code still has to extract and insert into the ymm vector. (**It's probably a slowdown. Did you test vs. a baseline?**) Also, don't use `__int64_t` directly. Just `#include ` and use `int64_t` so your code is portable. Fortunately, the low 64bits result of a 64bx64b multiply is the same whether the input is signed or unsigned, so the code gives identical results either way. – Peter Cordes May 19 '16 at 15:39

2 Answers2

20

You seem to be assuming long is 64bits in your code, but then using __uint64_t as well. In 32bit, the x32 ABI, and on Windows, long is a 32bit type. Your title mentions long long, but then your code ignores it. I was wondering for a while if your code was assuming that long was 32bit.

You're completely shooting yourself in the foot by using AVX256 loads but then aliasing a pointer onto the __m256i to do scalar operations. gcc just gives up and gives you the terrible code you asked for: vector load and then a bunch of extract and insert instructions. Your way of writing it means that both vectors have to be unpacked to do the sub in scalar as well, instead of using vpsubq.

Modern x86 CPUs have very fast L1 cache that can handle two operations per clock. (Haswell and later: two loads and one store per clock). Doing multiple scalar loads from the same cache line is better than a vector load and unpacking. (Imperfect uop scheduling reduces the throughput to about 84% of that, though: see below)


gcc 5.3 -O3 -march=haswell (Godbolt compiler explorer) auto-vectorizes a simple scalar implementation pretty well. When AVX2 isn't available, gcc foolishly still auto-vectorizes with 128b vectors: On Haswell, this will actually be about 1/2 the speed of ideal scalar 64bit code. (See the perf analysis below, but substitute 2 elements per vector instead of 4).

#include <stdint.h>    // why not use this like a normal person?
#define BASE_VEX_STOP 1024
#define BASE_DIMENSION 1028

// restrict lets the compiler know the arrays don't overlap,
// so it doesn't have to generate a scalar fallback case
void multiply_simple(uint64_t *restrict Gi_vec, uint64_t q, const uint64_t *restrict Gj_vec){
    for (intptr_t i=0; i<BASE_DIMENSION; i++)   // gcc doesn't manage to optimize away the sign-extension from 32bit to pointer-size in the scalar epilogue to handle the last less-than-a-vector elements
        Gi_vec[i] -= Gj_vec[i] * q;
}

inner loop:

.L4:
    vmovdqu ymm1, YMMWORD PTR [r9+rax]        # MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
    add     rcx, 1    # ivtmp.30,
    vpsrlq  ymm0, ymm1, 32      # tmp174, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B],
    vpmuludq        ymm2, ymm1, ymm3        # tmp173, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B], vect_cst_.25
    vpmuludq        ymm0, ymm0, ymm3        # tmp176, tmp174, vect_cst_.25
    vpmuludq        ymm1, ymm4, ymm1        # tmp177, tmp185, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
    vpaddq  ymm0, ymm0, ymm1    # tmp176, tmp176, tmp177
    vmovdqa ymm1, YMMWORD PTR [r8+rax]        # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
    vpsllq  ymm0, ymm0, 32      # tmp176, tmp176,
    vpaddq  ymm0, ymm2, ymm0    # vect__13.24, tmp173, tmp176
    vpsubq  ymm0, ymm1, ymm0    # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
    vmovdqa YMMWORD PTR [r8+rax], ymm0        # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
    add     rax, 32   # ivtmp.32,
    cmp     rcx, r10  # ivtmp.30, bnd.14
    jb      .L4 #,

Translate that back to intrinsics if you want, but it's going to be a lot easier to just let the compiler autovectorize. I didn't try to analyse it to see if it's optimal.

If you don't usually compile with -O3, you could use #pragma omp simd before the loop (and -fopenmp).

Of course, instead of a scalar epilogue, it would prob. be faster to do an unaligned load of the last 32B of Gj_vec, and store into the last 32B of Gi_vec, potentially overlapping with the last store from the loop. (A scalar fallback is still needed if the arrays are smaller than 32B.)


Improved vector intrinsic version for Haswell

From my comments on Z Boson's answer. Based on Agner Fog's vector class library code.

Agner Fog's version saves an instruction but bottlenecks on the shuffle port by using phadd + pshufd where I use psrlq / paddq / pand.

Since one of your operands is constant, make sure to pass set1(q) as b, not a, so the "bswap" shuffle can be hoisted.

// replace hadd -> shuffle (4 uops) with shift/add/and (3 uops)
// The constant takes 2 insns to generate outside a loop.
__m256i mul64_haswell (__m256i a, __m256i b) {
    // instruction does not exist. Split into 32-bit multiplies
    __m256i bswap   = _mm256_shuffle_epi32(b,0xB1);           // swap H<->L
    __m256i prodlh  = _mm256_mullo_epi32(a,bswap);            // 32 bit L*H products

    // or use pshufb instead of psrlq to reduce port0 pressure on Haswell
    __m256i prodlh2 = _mm256_srli_epi64(prodlh, 32);          // 0  , a0Hb0L,          0, a1Hb1L
    __m256i prodlh3 = _mm256_add_epi32(prodlh2, prodlh);      // xxx, a0Lb0H+a0Hb0L, xxx, a1Lb1H+a1Hb1L
    __m256i prodlh4 = _mm256_and_si256(prodlh3, _mm256_set1_epi64x(0x00000000FFFFFFFF)); // zero high halves

    __m256i prodll  = _mm256_mul_epu32(a,b);                  // a0Lb0L,a1Lb1L, 64 bit unsigned products
    __m256i prod    = _mm256_add_epi64(prodll,prodlh4);       // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
    return  prod;
}

See it on Godbolt.

Note that this doesn't include the final subtract, only the multiply.

This version should perform a bit better on Haswell than gcc's autovectorized version. (like maybe one vector per 4 cycles instead of one vector per 5 cycles, bottlenecked on port0 throughput. I didn't consider other bottlenecks for the full problem, since this was a late addition to the answer.)

An AVX1 version (two elements per vector) would suck, and probably still be worse than 64bit scalar. Don't do it unless you already have your data in vectors, and want the result in a vector (extracting to scalar and back might not be worth it).


Perf analysis of GCC's autovectorized code (not the intrinsic version)

Background: see Agner Fog's insn tables and microarch guide, and other links in the tag wiki.

Until AVX512 (see below), this is probably only barely faster than scalar 64bit code: imul r64, m64 has a throughput of one per clock on Intel CPUs (but one per 4 clocks on AMD Bulldozer-family). load/imul/sub-with-memory-dest is 4 fused-domain uops on Intel CPUs (with an addressing mode that can micro-fuse, which gcc fails to use). The pipeline width is 4 fused-domain uops per clock, so even a large unroll can't get this to issue at one-per-clock. With enough unrolling, we'll bottleneck on load/store throughput. 2 loads and one store per clock is possible on Haswell, but stores-address uops stealing load ports will lower the throughput to about 81/96 = 84% of that, according to Intel's manual.

So perhaps the best way for Haswell would load and multiply with scalar, (2 uops), then vmovq / pinsrq / vinserti128 so you can do the subtract with a vpsubq. That's 8 uops to load&multiply all 4 scalars, 7 shuffle uops to get the data into a __m256i (2 (movq) + 4 (pinsrq is 2 uops) + 1 vinserti128), and 3 more uops to do a vector load / vpsubq / vector store. So that's 18 fused-domain uops per 4 multiplies (4.5 cycles to issue), but 7 shuffle uops (7 cycles to execute). So nvm, this is no good compared to pure scalar.


The autovectorized code is using 8 vector ALU instructions for each vector of four values. On Haswell, 5 of those uops (multiplies and shifts) can only run on port 0, so no matter how you unroll this algorithm it will achieve at best one vector per 5 cycles (i.e. one multiply per 5/4 cycles.)

The shifts could be replaced with pshufb (port 5) to move the data and shift in zeros. (Other shuffles don't support zeroing instead of copying a byte from the input, and there aren't any known-zeros in the input that we could copy.)

paddq / psubq can run on ports 1/5 on Haswell, or p015 on Skylake.

Skylake runs pmuludq and immediate-count vector shifts on on p01, so it could in theory manage a throughput of one vector per max(5/2, 8/3, 11/4) = 11/4 = 2.75 cycles. So it bottlenecks on total fused-domain uop throughput (including the 2 vector loads and 1 vector store). So a bit of loop unrolling will help. Probably resource conflicts from imperfect scheduling will bottleneck it to slightly less than 4 fused-domain uops per clock. The loop overhead can hopefully run on port 6, which can only handle some scalar ops, including add and compare-and-branch, leaving ports 0/1/5 for vector ALU ops, since they're close to saturating (8/3 = 2.666 clocks). The load/store ports are nowhere near saturating, though.

So, Skylake can theoretically manage one vector per 2.75 cycles (plus loop overhead), or one multiply per ~0.7 cycles, vs. Haswell's best option (one per ~1.2 cycles in theory with scalar, or one per 1.25 cycles in theory with vectors). The scalar one per ~1.2 cycles would probably require a hand-tuned asm loop, though, because compilers don't know how to use a one-register addressing mode for stores, and a two-register addressing mode for loads (dst + (src-dst) and increment dst).

Also, if your data isn't hot in L1 cache, getting the job done with fewer instructions lets the frontend get ahead of the execution units, and get started on the loads before the data is needed. Hardware prefetch doesn't cross page lines, so a vector loop will probably beat scalar in practice for large arrays, and maybe even for smaller arrays.


AVX-512DQ introduces a 64bx64b->64b vector multiply

gcc can auto-vectorize using it, if you add -mavx512dq.

.L4:
    vmovdqu64       zmm0, ZMMWORD PTR [r8+rax]    # vect__11.23, MEM[base: vectp_Gj_vec.22_86, index: ivtmp.32_76, offset: 0B]
    add     rcx, 1    # ivtmp.30,
    vpmullq zmm1, zmm0, zmm2  # vect__13.24, vect__11.23, vect_cst_.25
    vmovdqa64       zmm0, ZMMWORD PTR [r9+rax]    # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B]
    vpsubq  zmm0, zmm0, zmm1    # vect__14.26, MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__13.24
    vmovdqa64       ZMMWORD PTR [r9+rax], zmm0    # MEM[base: vectp_Gi_vec.19_81, index: ivtmp.32_76, offset: 0B], vect__14.26
    add     rax, 64   # ivtmp.32,
    cmp     rcx, r10  # ivtmp.30, bnd.14
    jb      .L4 #,

So AVX512DQ (expected to be part of Skylake multi-socket Xeon (Purley) in ~2017) will give a much larger than 2x speedup (from wider vectors) if these instructions are pipelined at one per clock.

Update: Skylake-AVX512 (aka SKL-X or SKL-SP) runs VPMULLQ at one per 1.5 cycles for xmm, ymm, or zmm vectors. It's 3 uops with 15c latency. (With maybe an extra 1c of latency for the zmm version, if that's not a measurement glitch in the AIDA results.)

vpmullq is much faster than anything you can build out of 32-bit chunks, so it's very much worth having an instruction for this even if current CPUs don't have 64-bit-element vector-multiply hardware. (Presumably they use the mantissa multipliers in the FMA units.)

Toby Speight
  • 27,591
  • 48
  • 66
  • 103
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • This is a very helpful post. While this site does not encourage questions to ask about recommended resources, perhaps a comment about *Where can I learn about this stuff step-by-step in a tutorial way?* could be much appreciated. – Matsmath May 19 '16 at 10:30
  • 1
    @Matsmath: No worries. updated to link to Agner Fog's guides and the x86 tag wiki. I learned this stuff by reading Agner Fog's guides, and reading the http://realworldtech.com/ forums and microarchitecture writeups. (Just reading, though; I've only posted once or twice there). If you don't know how a simple pipelined CPU works, maybe read about RISC pipelines. – Peter Cordes May 19 '16 at 10:36
  • I think you could be more precise when you write "Modern Intel CPUs". Maybe "Since Haswell (2013)..." Because before Haswell Intel CPUs could not do two SIMD loads and one SIMD store per cycle. Very few people at my work have anything as modern as Haswell. – Z boson May 19 '16 at 11:23
  • @Zboson: That was a sloppy edit after I noticed that the algorithm was a read-modify-write. (`-=` not `=`). Updated to apply to Sandybridge-family and AMD Bulldozer-family. – Peter Cordes May 19 '16 at 11:27
  • 1
    This may be iteresting to you. For 128bx128b to 128b (lower) for unsigned you only need 32bx32b->64b. I mean 64bx64b ->64b (lower) does not help. But for signed 128bx128b to 128b (lower) you need a few more instructions if you don't have 64bx64b ->64b (lower). That's one reason why `vpmullq` (if it is not to slow) could be interesting. – Z boson May 19 '16 at 11:34
  • Oh, never mind, I forgot that `mulx` will still beat SIMD 128-bit mul anyway so `vpmullq` is probably only interesting for 64bx64b ->64b (lower). – Z boson May 19 '16 at 11:46
  • @Zboson: fun fact: `mulx` has 4 cycle latency vs. 3 cycle latency for `mul`. IDK why. They're both 2 uops (p1+p5 vs. p1+p6). I keep forgetting that `mulx` produces a 128b result, though. But of course it does, that's what makes it useful to interleave with `adc`. (And otherwise its only purpose would be a 3-operand non-destructive `imul r,r/m`: useful in this case to save one fused-domain uop for load+mul.) – Peter Cordes May 19 '16 at 12:34
  • Agner Fog's tables place `[V]PS[L/R]LDQ` on port 5 on Haswell, staying out of the way of the multiplies. – EOF May 19 '16 at 12:43
  • @EOF: That's the byte-shift instruction (for one 128b element). We need to shift in zeros into the high 32 of each 64bit element. `pshufb` will do the trick, but `psrldq` won't. – Peter Cordes May 19 '16 at 12:47
  • @PeterCordes Fair point. On the other hand `[V]PAND` can run on ports 0,1,5 and is has a single cycle latency. – EOF May 19 '16 at 12:50
  • @EOF: We're bottlenecked on fused-domain uop throughput on Skylake, so more instructions hurts there. Using a shuffle-immediate + AND isn't worth it to avoid needing a 16B `pshufb` shuffle control mask. Even on Haswell, an extra `vpand` will probably steal some port0 cycles, reducing total throughput. – Peter Cordes May 19 '16 at 13:30
  • So the "moral of the story" is that for AVX you can't beat 64-bit scalar mult but with AVX2 you can maybe marginally beat 64-bit scalar mult? – Z boson May 20 '16 at 06:03
  • @Zboson: I think the intrinsic AVX2 version is a bit faster than scalar on Haswell. Both versions will be much faster on Skylake. Also note that getting a compiler to make the "ideal" scalar code is probably really hard, so the penalty for compilers being dumb will be worse than for vectors. It requires one-register addressing modes for the dst, but not src. With a big unroll, wasting an insn on src inc is ok. An AVX version on IvyBridge is not quite as bad as Haswell, but prob. still worse than scalar. `pmulld` is only 1 uop there. Emulating the shift with `pshufb` would still help. – Peter Cordes May 20 '16 at 06:28
  • 1
    I have a very strong feeling that `vpmullq` is not going to be 1 per clock. (2 if you include both ports.) The existence of the IFMA52 instructions basically says that the chip is not going to have a native SIMD 64-bit multiplier. And to be fair, going full 64-bit multipliers in the SIMD unit is lot of extra area. – Mysticial Jun 13 '16 at 18:49
  • 2
    @Zboson [Some latency #'s are out for Skylake Purley.](https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel0050654_SkylakeX_InstLatX64.txt) As I expected, `vpmullq` is not 1 uop. It appears to be 3 with a latency of about 15 cycles (3 x 5 cycle multiply). I actually [predicted it could be 3 a while back.](https://stackoverflow.com/questions/41403718/can-i-use-the-avx-fma-units-to-do-bit-exact-52-bit-integer-multiplications#comment70110790_41403718) – Mysticial Jun 30 '17 at 19:13
  • @Mysticial: Thanks! Those numbers don't say what port instructions run on, so you can't tell which throughput=1 instructions conflict with each other. I did some testing a while ago on a Google Cloud SKL-AVX512 VM. I think my conclusion was that mask instructions (`kand` / `kunpck` etc.) run on port5, conflicting with shuffles, but that was a while ago and GCE doesn't expose perf counters so it was guesswork. I might be mis-remembering, and I don't remember if I made notes. – Peter Cordes Jun 30 '17 at 23:08
  • @Mysticial: BTW, [Anandtech says](http://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/3) the 6 and 8-core part will have half-throughput for AVX512 FMA, with some weird execution-unit sharing, and having an FMA unit on port5. (Maybe not execution *port* sharing though; hopefully ZMM FMA is still only 1 uop and can issue at the same time as a non-FP instruction for port0 or port1) That report confirms that the 10-core part has full-throughput for 512b FMA, but doesn't shed any light on the execution-port/unit possible weirdness. – Peter Cordes Jun 30 '17 at 23:12
  • 1
    @PeterCordes [Here's a benchmark that shows an ES 6-core having full-throughput AVX512.](http://www.overclock.net/t/1632638/just-a-heads-up-for-anyone-buying-x299-if-you-want-full-avx-capability-you-need-to-buy-the-7900x-and-up/20#post_26181442) But I have yet to find out if retail 6 and 8-cores have half or full-throughput. I literally just finished assembling a 7900X system which is sitting on my kitchen counter right now running updates. So I have yet to run any proper tests on it. But that won't answer the question of whether the 6 and 8 cores will have full-throughput AVX512. – Mysticial Jul 01 '17 at 04:33
  • Right away I can tell that most of my AVX512 code has some severe bottlenecks from yet-to-be-determined sources. But I don't want to run too many tests while the system is updating since there seems to be a hint of instability that I'll need to sort out later. The motherboard seems to be overclocking it to 4 GHz without me asking it to do so. 4 GHz on all cores is a bit too high to be the normal turbo. I'm gonna be busy this weekend. – Mysticial Jul 01 '17 at 04:35
  • @Mysticial: Looking forward to finding out the full details once the dust settles. :) Hopefully someone will give Agner Fog remote access to a system so he can update his insn tables, too. – Peter Cordes Jul 01 '17 at 04:39
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/148101/discussion-between-mysticial-and-peter-cordes). – Mysticial Jul 01 '17 at 04:43
  • @Mysticial from what I read on [anandtech](http://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested) the i7-7800X and i7-7820X don't increase the throughput of AVX512 compared to AVX. They split a 512-bit operation between two 256-bit ports (similar to how AMD splits AVX between two 128-bit ports). However, the 7900X has a dedicated AVX512 port so it does double the throughput. Though it seems it by be complicated to optimized with one 512-bit port and two 256-bit ports to get max throughput (two 512-bit ports would be simpler). – Z boson Jul 03 '17 at 06:38
  • @Zboson: My guess is that they have the weird multi-port stuff so flaws in the silicon somewhere in an FMA unit let them still use that core with full throughput for 256b and 128b FMA. So they can fuse off a couple whole cores, *and* parts of the FMA units in some other cores, and sell it as an 8-core i7-7820X. On the high-core and extreme-core silicon, where all the SKUs must have full-throughput FMA units, they might (I hope) just widen the FMA units on p0/p1 to 512b, since any flaw anywhere in an FMA unit means the core is ruined. – Peter Cordes Jul 03 '17 at 06:45
  • 1
    @Zboson If you read the chat transcript between me and Peter, you'll notice that I've been observing a sort of "AVX512 throttle" that kicks in when the TDP is exceeded. I strongly suspect that this involves turning off that port5 dedicated FMA. I don't have enough evidence to confirm this theory, but so far everything I've seen agrees with it. – Mysticial Jul 03 '17 at 14:45
  • @PeterCordes They bin them differently. If there's anything wrong with the core (bad FMA, low clock speeds), they simply disable the entire core and sell it a lower core-count part. Every single activated core on these chips are required to hit a minimum of 4.3 GHz. That's high enough that there's almost no room for any errors. – Mysticial Jul 03 '17 at 14:53
  • @Mysticial: My thinking is that the routing from ports to actual FMA execution units could be chosen after testing for defects. With 4 x 256b "lanes" of FMA units, only 2 of them need to be defect-free for the core to be usable. p0 and p1 get wired up to lanes that work. If all 4 work, then that core can be used in a full-FMA-throughput SKU with p5 wired too. This design would let them fuse off defects with smaller granularity than at the core level. If they weren't doing something like this, why the heck would they complicate things instead of just widening the p0 / p1 FMAs to 512b? – Peter Cordes Jul 03 '17 at 15:26
  • @PeterCordes I'm tempted to think that it's not that simple to "simply rewire the working 256-bit lanes". The logic there is probably not trivial. Since each 256-bit lane would need to be designed to be either the bottom or top half of the 512-bit unit. And that probably complicates a lot of the routing as well. The way it's designed now, that port5 FMA is the one and only "optional" feature that can be disabled without turning off the entire core. And based on the throttling that I'm seeing, they made it run-time toggleable. – Mysticial Jul 03 '17 at 15:35
  • @Mysticial: oh right. That makes much more sense. Anyway, that still lets them fuse off an FMA but keep the core (for a 512b-half-throughput SKU), if there's a defect in the p5 FMA but the core is otherwise fine. They couldn't do that nearly as easily if they just widened the p0 and p1 FMA units. But they might just widen p01 in the higher core-count silicon since there are no 512b-half-throughput SKUs that use that silicon, so a defect anywhere in any FMA unit kills the core. – Peter Cordes Jul 03 '17 at 15:49
  • @Mysticial, why did Intel bother implementing `vpmullq`? Is this for forward planning? A future processor could implement it as a single uop. That's the only good reason I can see for them to implement `vpmullq` now. – Z boson Jul 04 '17 at 06:50
  • 1
    @Zboson: One per 1.5 cycles is *much* better throughput than our AVX2 32-bit-chunks implementations achieve, and probably also better latency. 3 uops is not bad at all. `vpmulld` is still 2 uops on Haswell and later, including Skylake. (up from 1 uop on SnB/IvB). – Peter Cordes Jul 04 '17 at 07:03
4

If you're interested in SIMD 64bx64b to 64b (lower) operations here are the AVX and AVX2 solutions from Agner Fog's Vector Class Library. I would test these with arrays and see how it compares to what GCC does with a generic loop such as the one in Peter Cordes' answer.

AVX (use SSE - you can still compile with -mavx to get vex encoding).

// vector operator * : multiply element by element
static inline Vec2q operator * (Vec2q const & a, Vec2q const & b) {
#if INSTRSET >= 5   // SSE4.1 supported
    // instruction does not exist. Split into 32-bit multiplies
    __m128i bswap   = _mm_shuffle_epi32(b,0xB1);           // b0H,b0L,b1H,b1L (swap H<->L)
    __m128i prodlh  = _mm_mullo_epi32(a,bswap);            // a0Lb0H,a0Hb0L,a1Lb1H,a1Hb1L, 32 bit L*H products
    __m128i zero    = _mm_setzero_si128();                 // 0
    __m128i prodlh2 = _mm_hadd_epi32(prodlh,zero);         // a0Lb0H+a0Hb0L,a1Lb1H+a1Hb1L,0,0
    __m128i prodlh3 = _mm_shuffle_epi32(prodlh2,0x73);     // 0, a0Lb0H+a0Hb0L, 0, a1Lb1H+a1Hb1L
    __m128i prodll  = _mm_mul_epu32(a,b);                  // a0Lb0L,a1Lb1L, 64 bit unsigned products
    __m128i prod    = _mm_add_epi64(prodll,prodlh3);       // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
    return  prod;
#else               // SSE2
    int64_t aa[2], bb[2];
    a.store(aa);                                           // split into elements
    b.store(bb);
    return Vec2q(aa[0]*bb[0], aa[1]*bb[1]);                // multiply elements separetely
#endif
}

AVX2

// vector operator * : multiply element by element
static inline Vec4q operator * (Vec4q const & a, Vec4q const & b) {
    // instruction does not exist. Split into 32-bit multiplies
    __m256i bswap   = _mm256_shuffle_epi32(b,0xB1);           // swap H<->L
    __m256i prodlh  = _mm256_mullo_epi32(a,bswap);            // 32 bit L*H products
    __m256i zero    = _mm256_setzero_si256();                 // 0
    __m256i prodlh2 = _mm256_hadd_epi32(prodlh,zero);         // a0Lb0H+a0Hb0L,a1Lb1H+a1Hb1L,0,0
    __m256i prodlh3 = _mm256_shuffle_epi32(prodlh2,0x73);     // 0, a0Lb0H+a0Hb0L, 0, a1Lb1H+a1Hb1L
    __m256i prodll  = _mm256_mul_epu32(a,b);                  // a0Lb0L,a1Lb1L, 64 bit unsigned products
    __m256i prod    = _mm256_add_epi64(prodll,prodlh3);       // a0Lb0L+(a0Lb0H+a0Hb0L)<<32, a1Lb1L+(a1Lb1H+a1Hb1L)<<32
    return  prod;
}

These functions work for signed and unsigned 64-bit integers. In your case since q is constant within the loop you don't need to recalculate some things every iteration but your compiler will probably figure that out anyway.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • Those are 9 fused-domain uops on Haswell/SKL (where `[v]pmulld` is 2 uops, vs. 1 on SnB). In the unfused domain: 4 shuffle uops (p5), 3 mul uops (p0), and 2 add uops (one p1-only, one p15). So with good scheduling, it runs at one per 4c on Haswell and Skylake. (vs. one per 5c on Haswell or one per 2.5c on Skylake.) – Peter Cordes May 19 '16 at 13:01
  • Couldn't hadd/pshufd be replaced with something? [Yeah, we can use shift/add/and](https://godbolt.org/g/3MoJmL). 1 shuffle uop (p5), 3 mul+1 shift (p0), 2 ADD (p15), 1 AND (p015). We could replace the shift with a `pshufb` to reduce p0 pressure. Generating the constant now takes 2 insns (so compilers will choose to load it from memory), but I'm not counting that or the xor, since it can be hoisted after inlining. – Peter Cordes May 19 '16 at 13:21
  • With `b` as the same-every-time operand, yeah it can hoist the `bswap` shuffle. Interesting point. Everything else depends on `a*b`, though, so that's all. Make sure you use `b=q`, not `a=q`. – Peter Cordes May 19 '16 at 13:33
  • edited this into my answer. Thanks again for suggesting a different algorithm that can be optimized better. I didn't take the time to consider anything but gcc's autovectorized output. – Peter Cordes May 19 '16 at 13:44
  • @PeterCordes, cool that you improved Agner's solution. I would send Agner an email with your suggestion. If he likes it he probably will put it into the next version of his vector class library. – Z boson May 20 '16 at 06:05
  • 2
    Thank you. I will put this in the next update of my Vector class library – A Fog May 24 '16 at 16:45