13

I have created a function which does 64-bit * 64-bit to 128-bit using SIMD. Currently I have implemented it using SSE2 (acutally SSE4.1). This means it does two 64b*64b to 128b products at the same time. The same idea could be extended to AVX2 or AVX512 giving four or eight 64b*64 to 128b products at the same time. I based my algorithm on http://www.hackersdelight.org/hdcodetxt/muldws.c.txt

That algorithm does one unsigned multiplication, one signed multiplication, and two signed * unsigned multiplications. The signed * signed and unsigned * unsigned operations are easy to do using _mm_mul_epi32 and _mm_mul_epu32. But the mixed signed and unsigned products caused me trouble. Consider for example.

int32_t x = 0x80000000;
uint32_t y = 0x7fffffff;
int64_t z = (int64_t)x*y;

The double word product should be 0xc000000080000000. But how can you get this if you assume your compiler does know how to handle mixed types? This is what I came up with:

int64_t sign = x<0; sign*=-1;        //get the sign and make it all ones
uint32_t t = abs(x);                 //if x<0 take two's complement again
uint64_t prod = (uint64_t)t*y;       //unsigned product
int64_t z = (prod ^ sign) - sign;    //take two's complement based on the sign

Using SSE this can be done like this

__m128i xh;    //(xl2, xh2, xl1, xh1) high is signed, low unsigned
__m128i yl;    //(yh2, yl2, yh2, yl2)
__m128i xs     = _mm_cmpgt_epi32(_mm_setzero_si128(), xh); // get sign
        xs     = _mm_shuffle_epi32(xs, 0xA0);              // extend sign
__m128i t      = _mm_sign_epi32(xh,xh);                    // abs(xh)
__m128i prod   = _mm_mul_epu32(t, yl);                     // unsigned (xh2*yl2,xh1*yl1)
__m128i inv    = _mm_xor_si128(prod,xs);                   // invert bits if negative
__m128i z      = _mm_sub_epi64(inv,xs);                    // add 1 if negative

This gives the correct result. But I have to do this twice (once when squaring) and it's now a significant fraction of my function. Is there a more efficient way of doing this with SSE4.2, AVX2 (four 128bit products), or even AVX512 (eight 128bit products)?

Maybe there are more efficient ways of doing this than with SIMD? It's a lot of calculations to get the upper word.

Edit: based on the comment by @ElderBug it looks like the way to do this is not with SIMD but with the mul instruction. For what it's worth, in case anyone want's to see how complicated this is, here is the full working function (I just got it working so I have not optimized it but I don't think it's worth it).

void muldws1_sse(__m128i x, __m128i y, __m128i *lo, __m128i *hi) {
    __m128i lomask = _mm_set1_epi64x(0xffffffff);

    __m128i xh     = _mm_shuffle_epi32(x, 0xB1);    // x0l, x0h, x1l, x1h
    __m128i yh     = _mm_shuffle_epi32(y, 0xB1);    // y0l, y0h, y1l, y1h

    __m128i xs     = _mm_cmpgt_epi32(_mm_setzero_si128(), xh);
    __m128i ys     = _mm_cmpgt_epi32(_mm_setzero_si128(), yh);
            xs     = _mm_shuffle_epi32(xs, 0xA0);
            ys     = _mm_shuffle_epi32(ys, 0xA0);

    __m128i w0     = _mm_mul_epu32(x,  y);          // x0l*y0l, y0l*y0h
    __m128i w3     = _mm_mul_epi32(xh, yh);         // x0h*y0h, x1h*y1h
            xh     = _mm_sign_epi32(xh,xh);
            yh     = _mm_sign_epi32(yh,yh);

    __m128i w1     = _mm_mul_epu32(x,  yh);         // x0l*y0h, x1l*y1h
    __m128i w2     = _mm_mul_epu32(xh, y);          // x0h*y0l, x1h*y0l

    __m128i yinv   = _mm_xor_si128(w1,ys);          // invert bits if negative
            w1     = _mm_sub_epi64(yinv,ys);         // add 1
    __m128i xinv   = _mm_xor_si128(w2,xs);          // invert bits if negative
            w2     = _mm_sub_epi64(xinv,xs);         // add 1

    __m128i w0l    = _mm_and_si128(w0, lomask);
    __m128i w0h    = _mm_srli_epi64(w0, 32);

    __m128i s1     = _mm_add_epi64(w1, w0h);         // xl*yh + w0h;
    __m128i s1l    = _mm_and_si128(s1, lomask);      // lo(wl*yh + w0h);
    __m128i s1h    = _mm_srai_epi64(s1, 32);

    __m128i s2     = _mm_add_epi64(w2, s1l);         //xh*yl + s1l
    __m128i s2l    = _mm_slli_epi64(s2, 32);
    __m128i s2h    = _mm_srai_epi64(s2, 32);           //arithmetic shift right

    __m128i hi1    = _mm_add_epi64(w3, s1h);
            hi1    = _mm_add_epi64(hi1, s2h);

    __m128i lo1    = _mm_add_epi64(w0l, s2l);
    *hi = hi1;
    *lo = lo1;
}

It gets worse. There is no _mm_srai_epi64 instrinsic/instruction until AVX512 so I had to make my own.

static inline __m128i _mm_srai_epi64(__m128i a, int b) {
    __m128i sra = _mm_srai_epi32(a,32);
    __m128i srl = _mm_srli_epi64(a,32);
    __m128i mask = _mm_set_epi32(-1,0,-1,0);
    __m128i out = _mm_blendv_epi8(srl, sra, mask);
}

My implementation of _mm_srai_epi64 above is incomplete. I think I was using Agner Fog's Vector Class Library. If you look in the file vectori128.h you find

static inline Vec2q operator >> (Vec2q const & a, int32_t b) {
    // instruction does not exist. Split into 32-bit shifts
    if (b <= 32) {
        __m128i bb   = _mm_cvtsi32_si128(b);               // b
        __m128i sra  = _mm_sra_epi32(a,bb);                // a >> b signed dwords
        __m128i srl  = _mm_srl_epi64(a,bb);                // a >> b unsigned qwords
        __m128i mask = _mm_setr_epi32(0,-1,0,-1);          // mask for signed high part
        return  selectb(mask,sra,srl);
    }
    else {  // b > 32
        __m128i bm32 = _mm_cvtsi32_si128(b-32);            // b - 32
        __m128i sign = _mm_srai_epi32(a,31);               // sign of a
        __m128i sra2 = _mm_sra_epi32(a,bm32);              // a >> (b-32) signed dwords
        __m128i sra3 = _mm_srli_epi64(sra2,32);            // a >> (b-32) >> 32 (second shift unsigned qword)
        __m128i mask = _mm_setr_epi32(0,-1,0,-1);          // mask for high part containing only sign
        return  selectb(mask,sign,sra3);
    }
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 4
    Can't you just use the assembly `mul` instruction ? It already does 64*64 = 128 multiplication, and it's a single instruction. – ElderBug Mar 02 '15 at 11:03
  • @ElderBug, yeah, [I just learned that it does this](https://stackoverflow.com/questions/25241477/signed-and-unsigned-arithmetic-implementation-on-x86/25242228). Maybe I'm wasting my time with this. With AVX512 I could do eight 128 products at once. I don't know if that will beat `mul` eight times. AVX512 has `_mm512_mullo_epi64` but sadly no `_mm512_mul_epi64`. – Z boson Mar 02 '15 at 11:37
  • I'm not sure if you can do 64x64->128 multiplications in less than 8 AVX512 instructions or not. If it's not possible I think 8 `mul`s will still be faster – phuclv Mar 02 '15 at 12:08
  • It probably depends on what you want. If you can easily vectorize your application (little vectorization cost), maybe AVX512 is worth it. Maybe even AVX, but I highly doubt it is with just SSE. Also, 1 million multiplications with `mul` will take around 1ms to compute, so it's probably not worth the trouble if you have less than tens of millions. – ElderBug Mar 02 '15 at 12:12
  • @LưuVĩnhPhúc, yeah I'm almost certain you're right. This is disappointing. (though I learned a lot doing this). There is still no efficient way to do this with SIMD even with AVX512. I wonder if [Intel ADX](https://en.wikipedia.org/wiki/Intel_ADX) or [BMI](https://en.wikipedia.org/wiki/BMI2) could help (I'm new to this subject)? I have never used these. – Z boson Mar 02 '15 at 12:14
  • @LưuVĩnhPhúc It is more like 16 AVX512 instructions, because AVX will usually be twice as fast as `mul`. But anyway, I don't think it is worth it even with 16. – ElderBug Mar 02 '15 at 12:14
  • @ElderBug, I'm creating fractals using float or double using SIMD and it works great (four doubles at once with AVX). But after some zooming it's no good. Now I want to [implement fixed point Q32.96](http://www.bealto.com/mp-mandelbrot_fp128-opencl.html) so I can zoom in much further. I guess this can't be done efficiently with SIMD. – Z boson Mar 02 '15 at 12:17
  • 1
    This is about multiplication so I don't think Intel ADX will work here, but `MULX` in BMI2 may improve performance a bit. Since you want to increase the precision of the calculations maybe you'll be interested in [double-double arithmetics](https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic). It has wider dynamic range and is easier to vectorize because you don't need to carry from the low part anymore, and multiplications may also be easier. The downside is that the precision is limited to 106/107 bits – phuclv Mar 02 '15 at 12:29
  • @LưuVĩnhPhúc, yes, I'm definitely interested in double-double. Maybe that can be done efficiently with SIMD. I'll look into that now. Thanks! You don't happen to know a SSE or AVX implementation of it do you? – Z boson Mar 02 '15 at 12:33
  • @LưuVĩnhPhúc, actually, I don't need a SIMD version of double-double I just need to see some C code on how it can be implemented like muldws.c from Hackers delight? Do you know anywhere a double-double algorithm is described? I tried looking at some papers on the past on this but they were obtuse (too much text not enough code). – Z boson Mar 02 '15 at 12:41
  • I've heard some libraries for double-double but I'm not aware of a SIMD one. You can find some [here](http://crd-legacy.lbl.gov/~dhbailey/mpdist/) or read the paper [here](http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf). A fully accurate implementation may be much slower than the 128-bit int approach but if you can accept a lower precision than 106 bits, maybe it's worth a try. – phuclv Mar 02 '15 at 12:43
  • 4
    @Zboson: double-double is extremely easy to implement in SIMD (and pretty fast) on Haswell and beyond (thanks to FMA), if you are willing to take a relaxed attitude about edge cases. If you need to handle NaNs and inf correctly, it's a little bit more painful. You'll definitely want to keep the high and low parts in separate registers, not interleaved. – Stephen Canon Mar 02 '15 at 13:22
  • 2
    C (with some ppc inline asm) implementations of double-double: http://www.opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c – Stephen Canon Mar 02 '15 at 13:29
  • @StephenCanon, I looked at the double-double code you posted before I read your comment about how easy double-double. The code was so simple I thought it must be missing a lot. It seems double-double is probably a much better faster and simpler solution than implementing fixed point `Q32.96`. – Z boson Mar 03 '15 at 08:23
  • I can access @StephenCanon's link without problem. Here is the [plain text version](http://www.opensource.apple.com/source/gcc/gcc-5646/gcc/config/rs6000/darwin-ldouble.c?txt). [This answer](http://stackoverflow.com/a/6770329/995714) also gives 2 other papers about this – phuclv Mar 03 '15 at 17:59
  • @LưuVĩnhPhúc, thanks, the plain text version still says untrusted connection. Googling a bit on it shows some other people get the same thing. – Z boson Mar 04 '15 at 07:51
  • @LưuVĩnhPhúc, https://security.stackexchange.com/questions/64747/if-even-apple-cannot-properly-install-certificate – Z boson Mar 04 '15 at 07:53
  • @LưuVĩnhPhúc, apprently the certificate is only valid for https://opensource.apple.com not for https://www.opensource.apple.com. Those links look the same until you put your mouse over them. But the second one is preceded by www and the first one is not. – Z boson Mar 04 '15 at 08:52
  • @Zboson For AVX2 implementation I have a question: Does your algorithm takes 2x4unsigned 64b (a1,a2,a3,a4), (b1,b2,b3,b4) in 2 _m56i vectors and gives 2x4 unsigned 64b (hi(a1*b1),hi(a2*b2),hi(a3*b3),hi(a4*b4)) and (lo(a1*b1),lo(a2*b2),lo(a3*b3),lo(a4*b4)) in 2 _m256i vectors? Is the order correct? – Yigit Demirag Jul 04 '15 at 16:11
  • @YigitDemirag, I think so. I have not looked at my solution in a while but that sounds write. AVX can do 32*32 to 64 so it's based on that. – Z boson Jul 06 '15 at 09:06
  • Found this question when I was looking for the possible implementation of `_mm_srai_epi64`. Your implementation at the end of the question is erroneous. It doesn't even compile, on the other hand the whole logic doesn't implement `_mm_srai_epi64` correctly. Please at least rename it to avoid the misleading of new readers. – plasmacel Dec 28 '16 at 11:35
  • @plasmacel, why does it not compile? What compiler are you using? – Z boson Dec 28 '16 at 11:39
  • @Zboson You forgot to `return` the result or use argument `b`. This is not a real problem, but apart from these it doesn't perform 64-bit arithmetic right shifts correctly on arbitrary inputs. – plasmacel Dec 28 '16 at 11:49
  • @plasmacel, I think I was using Agner Fog's Vector Class Library at the time and I pasted a fraction of the code but not all of it. Download http://www.agner.org/optimize/vectorclass.zip and search for " Vec2q operator >>" in vectori128.h – Z boson Dec 28 '16 at 12:01
  • @plasmacel, I updated my answer. I hope that helps. – Z boson Dec 28 '16 at 12:05
  • @plasmacel, okay I was shifting by exactly 32. So I really implemented `_mm_srai_epi64(__m128i a, 32)` except I forgot to return `out`. – Z boson Dec 28 '16 at 12:08
  • @Zboson Yeah, I've seen that. At most cases where you need a 64bit right shift, lacking this instruction you can come up with a more efficient solution specific to that problem instead of implementing it fully as a whole. Actually I read this question some days ago, just wanted to make a note about it for correctness. – plasmacel Dec 28 '16 at 12:13

2 Answers2

10

I found a SIMD solution which is much simpler and does not need signed*unsigned products. I'm no longer convinced that SIMD (at least with AVX2 and AV512) can't compete with mulx. In some cases SIMD can compete with mulx. The only case I'm aware of is in FFT based multiplication of large numbers.

The trick was to do unsigned multiplication first and then correct. I learned how to do this from this answer 32-bit-signed-multiplication-without-using-64-bit-data-type. The correction is simple for (hi,lo) = x*y do unsigned multiplication first and then correct hi like this:

hi -= ((x<0) ? y : 0)  + ((y<0) ? x : 0)

This can be done with with the SSE4.2 intrinsic _mm_cmpgt_epi64

void muldws1_sse(__m128i x, __m128i y, __m128i *lo, __m128i *hi) {    
    muldwu1_sse(x,y,lo,hi);    
    //hi -= ((x<0) ? y : 0)  + ((y<0) ? x : 0);
    __m128i xs = _mm_cmpgt_epi64(_mm_setzero_si128(), x);
    __m128i ys = _mm_cmpgt_epi64(_mm_setzero_si128(), y);           
    __m128i t1 = _mm_and_si128(y,xs);
    __m128i t2 = _mm_and_si128(x,ys);
           *hi = _mm_sub_epi64(*hi,t1);
           *hi = _mm_sub_epi64(*hi,t2);
}

The code for the unsigned multiplication is simpler since it does not need mixed signed*unsigned products. Additionally, since it's unsigned it does not need arithmetic shift right which only has an instruction for AVX512. In fact the following function only needs SSE2:

void muldwu1_sse(__m128i x, __m128i y, __m128i *lo, __m128i *hi) {    
    __m128i lomask = _mm_set1_epi64x(0xffffffff);

    __m128i xh     = _mm_shuffle_epi32(x, 0xB1);    // x0l, x0h, x1l, x1h
    __m128i yh     = _mm_shuffle_epi32(y, 0xB1);    // y0l, y0h, y1l, y1h

    __m128i w0     = _mm_mul_epu32(x,  y);          // x0l*y0l, x1l*y1l
    __m128i w1     = _mm_mul_epu32(x,  yh);         // x0l*y0h, x1l*y1h
    __m128i w2     = _mm_mul_epu32(xh, y);          // x0h*y0l, x1h*y0l
    __m128i w3     = _mm_mul_epu32(xh, yh);         // x0h*y0h, x1h*y1h

    __m128i w0l    = _mm_and_si128(w0, lomask);     //(*)
    __m128i w0h    = _mm_srli_epi64(w0, 32);

    __m128i s1     = _mm_add_epi64(w1, w0h);
    __m128i s1l    = _mm_and_si128(s1, lomask);
    __m128i s1h    = _mm_srli_epi64(s1, 32);

    __m128i s2     = _mm_add_epi64(w2, s1l);
    __m128i s2l    = _mm_slli_epi64(s2, 32);        //(*)
    __m128i s2h    = _mm_srli_epi64(s2, 32);

    __m128i hi1    = _mm_add_epi64(w3, s1h);
            hi1    = _mm_add_epi64(hi1, s2h);

    __m128i lo1    = _mm_add_epi64(w0l, s2l);       //(*)
    //__m128i lo1    = _mm_mullo_epi64(x,y);          //alternative

    *hi = hi1;
    *lo = lo1;
}

This uses

4x mul_epu32
5x add_epi64
2x shuffle_epi32
2x and
2x srli_epi64
1x slli_epi64
****************
16 instructions

AVX512 has the _mm_mullo_epi64 intrinsic which can calculate lo with one instruction. In this case the alternative can be used (comment the lines with the (*) comment and uncomment the alternative line):

5x mul_epu32
4x add_epi64
2x shuffle_epi32
1x and
2x srli_epi64
****************
14 instructions

To change the code for full width AVX2 replace _mm with _mm256 , si128 with si256, and __m128i with __m256i for AVX512 replace them with _mm512, si512, and __m512i.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 2
    Yeah, your unsigned 64x64 multiply is similar to mine. My SSE4.1/AVX2 version is 19 instructions, but it trades away the shifts for shuffles since shifts use the same port as the multiplier. – Mysticial Mar 03 '15 at 08:37
  • And my AVX512 version is 15 instructions. – Mysticial Mar 03 '15 at 08:43
  • @Mysticial, that's good to know. I'm on the right track then. I have not tried to optimize the function (I have not even compiled with optimization yet - mostly been unit testing). I should run it through IACA. Why are you interested in the multi-word products? Is this for `pi` calculations? – Z boson Mar 03 '15 at 08:49
  • The 64-bit number-theoretic transform has a kernel that needs two 64x64 lower-half products and one 64x64 upper-half product. – Mysticial Mar 03 '15 at 08:51
  • If SSE/AVX instructions can be executed simultaneously then multiple MULX instructions can also run at the same time. But you may save a lot of time flushing SIMD registers to memory and add serially. Anw the way to correct unsigned to signed multiplication above is interesting – phuclv Mar 03 '15 at 18:22
  • @Mysticial, IFMA! Is that why your AVX512 version is shorter than your SSE4.2/AVX2 version? I could cut my AVX512 version down by three instructions by using IFMA. – Z boson Mar 04 '15 at 07:55
  • @LưuVĩnhPhúc, I think the throughput of MULX is 1 per cycle whereas for `_mm_add_epi64` it's two per cycle. Additionally, some of the shuffles and ands can run on other ports at the same time as the mults and adds I'll probably have to change the shifts to shuffles. Anyway I agree that I need to compare the two methods. – Z boson Mar 04 '15 at 08:00
  • @Zboson Not the IFMA, but rather the embedded masking that lets me combine some instructions. – Mysticial Mar 04 '15 at 08:05
  • @Mysticial, okay, i'll look into that. But you could use IFMA? Is there a reason you don't? It seems like a good candidate. – Z boson Mar 04 '15 at 08:07
  • 1
    I haven't tried, but I'm not sure it will be helpful at all. It takes two IFMA instructions to get a full 104-bit product. And you'll still need 4 of those to get up to 128-bit. Not to mention the extra masking and shifting that will be needed. – Mysticial Mar 04 '15 at 08:11
  • @Mysticial, I just looked up IFMA. What's this strange 52-bit integer stuff (e.g. `_mm_madd52hi_epu64`)? I see you're point now. IFMA is not exactly what I thought it was. I guess in my case I don't really need 128 bits. I could use 104. This is strange. – Z boson Mar 04 '15 at 08:15
  • 1
    It's pretty obvious that it exposes the double-precision circuitry. And it's also the reason why I believe that `_mm512_mullo_epi64()` is not going to be a fast instruction. At least not through Skylake/Cannonlake. – Mysticial Mar 04 '15 at 08:18
  • @Zboson On my MIC, I am getting the error _mm512_mul_epu32 is undefined. Therefore I cannot use this multiplication algorithm on MIC. Is it strange since _mm512_mul_epu32 is a AVX-512F instruction. – Yigit Demirag Jul 17 '15 at 09:44
  • @YigitDemirag, I have never used MIC (though I wish I could!). As far as I understand the 512-bit SIMD in Knights Corner is not 100% compatible with AVX512. – Z boson Jul 17 '15 at 09:46
  • @Zboson Yes, it seems so. What can else I used instead of KNC to benefit from AVX512? Is KNL published yet? Because I really could not benefit from AVX512 here. – Yigit Demirag Jul 17 '15 at 09:53
  • @YigitDemirag, as far as I know there is not AVX512 hardware available for the public yet. You're in Geneva. Do you work at CERN? I worked there for four years. – Z boson Jul 17 '15 at 10:34
  • @Zboson Ohh nice.I am also work at CERN. As I just confirmed, Knights Landing did not go public yet. It is internally used by Intel. Therefore the best thing would be just coding for AVX512 and wait for public release :) – Yigit Demirag Jul 17 '15 at 13:08
  • @YigitDemirag, in case you're interested there is special version of the [Vector Class Library](http://agner.org/optimize/#vectorclass) for Knights Corner. [It was created by a guy at CERN named Przemysław Karpiński](https://bitbucket.org/veclibknc/vclknc). – Z boson Jul 17 '15 at 13:20
  • @Zboson I am aware of that library however I don't see any benefit of it since I will need SIMD version of [hi,lo] = (64b x 64b) and the best candidate is your algorithm here which does not work on KNC. I don't think loop over MULX 8 times gives a good performance here, what do you think? – Yigit Demirag Jul 17 '15 at 13:31
  • @YigitDemirag, I think you're not going to beat `mulx` except in a few special cases even with AVX512. As Mysticial explained `_mm512_mullo_epi64` is likely to be slow anyway. But of course it's impossible to say for sure until we have real hardware. And even if `_mm512_mullo_epi64` was fast it would still be hard to beat `mulx`. – Z boson Jul 17 '15 at 13:39
9

The right way to think about the throughput limits of integer multiplication using various instructions is in terms of how many "product bits" you can compute per cycle.

mulx produces one 64x64 -> 128 result every cycle; that's 64x64 = 4096 "product bits per cycle"

If you piece together a multiplier on SIMD out of instructions that do 32x32 -> 64 bit multiplies, you need to be able to get four results every cycle to match mulx (4x32x32 = 4096). If there was no arithmetic other than the multiplies, you'd just break even on AVX2. Unfortunately, as you've noticed, there's lots of arithmetic other than the multiplies, so this is a total non-starter on current generation hardware.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • What got me started on this was `_mm512_mullo_epi64`. I thought I could use this to get the upper word quickly. [But it turns out that getting the upper world given the lower word is hardly easier](https://stackoverflow.com/questions/28766755/compute-the-doubleword-product-signed-of-two-words-given-the-lower-word-produc) (it's a lot of work determining a carry of 0, 1, or 2 not to mention the sign*unsigned terms). I'm really surprised by this. Because Intel did not create `_mm512_mul_epi64` an extension of a 16-bit scalar instruction from the 80s trumps the best vector instruction from 2015. – Z boson Mar 02 '15 at 15:01
  • @Zboson I wouldn't count on `_mm512_mullo_epi64` being a fast instruction. The fact that Intel added IFMA52 instructions (probably for Cannonlake based on their emulator) suggests that Intel has no intention to put a 64-bit multiplier in their SIMD any time soon. – Mysticial Mar 02 '15 at 19:32
  • @Zboson Also, getting the upper half of an *unsigned* 64x64 multiply is much easier. I have an implementation of this with AVX2. In the context of my particular application, it works out to be about 40% faster than pulling them out into scalar `mulx` and repacking them. – Mysticial Mar 02 '15 at 19:53
  • @Mysticial, I posted an answer to my question which uses a much simpler unsigned 64x64 to 128 to construct a signed product. I think it's much better than what I had before. Is my unsigned function `muldwu1_sse` similar to your AVX2 version? – Z boson Mar 03 '15 at 08:25
  • @Mysticial, it turns out that the signed version should take the same number of instructions as the unsigned version if there is an instruction which can do 64x64 to 64. So on AVX512 the signed version could be implemented using only 16 instructions like I did with the unsigned version. However, as you said the 64x64 to 64 instruction might be much slower than the 32x32 to 64 instruction so doing it three times in the signed version might be much slower than using unsigned and than correcting. – Z boson Mar 05 '15 at 12:12
  • @Mysticial, I explained why in my edit here https://stackoverflow.com/questions/28766755/compute-the-doubleword-product-signed-of-two-words-given-the-lower-word-produc/28828840#28828840 – Z boson Mar 05 '15 at 12:12
  • @Mystical _why_ does Intel not plan on putting a 64 bit multiplier in the SIMD unit? It seems like it would not be excessively large. – Demi Oct 21 '16 at 03:56
  • @Demi a 64-bit multiplier is 50% bigger than a 53-bit multiplier (the biggest operation currently available). Adding one would increase the area used for SIMD by a significant factor (meaning fewer SIMD units per processor area to support a fairly niche feature). – Stephen Canon Oct 28 '16 at 01:14