8

How to multiply two 64-bit integers by another 2 64-bit integers? I didn't find any instruction which can do it.

Ines Karmani
  • 81
  • 1
  • 2
  • What does "two 64 bit integers" mean in this context? Do you mean a pair of 64 bit integers (a la complex numbers), or a single 128 bit integer represented as a pair of 64 bit integers? – Eric Brown Jul 25 '13 at 16:21
  • I mean a single m128i bit integer represented as a pair of 64 bit integers – Ines Karmani Jul 25 '13 at 16:27
  • 1
    Possible duplicate of [this question](http://stackoverflow.com/questions/12200698/is-it-possible-to-use-sse-v2-to-make-a-128-bit-wide-integer), then. – Eric Brown Jul 25 '13 at 16:45
  • Related: [Fastest way to multiply an array of int64\_t?](https://stackoverflow.com/q/37296289) for AVX2 or SSE4.1, with performance analysis vs 64-bit scalar code (in case you don't already have your data in SIMD vectors.) – Peter Cordes Jan 15 '19 at 07:23

3 Answers3

6

Late answer, but this is a better version of what Barabas posted.

If you have ever used GCC or Clang's vector extensions, this is the routine they use.

This uses the same method that long multiplication and grid multiplication use.

    65
  * 73
  ----
    15 //   (5 * 3)
   180 //   (6 * 3) * 10
   350 //   (5 * 7) * 10
+ 4200 // + (6 * 7) * 100
------
  4745

However, instead of doing each unit of 10, it uses each unit of 32 bits, and it leaves out the last multiply because it will always be shifted past the 64th bit, just like how you wouldn't multiply 6*7 if you were truncating values greater than 99.

#include <emmintrin.h>

/*
 * Grid/long multiply two 64-bit SSE lanes.
 * Works for both signed and unsigned.
 *   ----------------.--------------.----------------.
 *  |                |   b >> 32    | a & 0xFFFFFFFF |
 *  |----------------|--------------|----------------|  
 *  | d >> 32        |   b*d << 64  |    a*d << 32   |
 *  |----------------|--------------|----------------|
 *  | c & 0xFFFFFFFF |   b*c << 32  |       a*c      |
 *  '----------------'--------------'----------------'
 *  Add all of them together to get the product.
 *
 *  Because we truncate the value to 64 bits, b*d << 64 will be zero,
 *  so we can leave it out.
 *
 *  We also can add a*d and b*c first and then shift because of the
 *  distributive property: (a << 32) + (b << 32) == (a + b) << 32.
 */

__m128i Multiply64Bit(__m128i ab, __m128i cd)
{
    /* ac = (ab & 0xFFFFFFFF) * (cd & 0xFFFFFFFF); */
    __m128i ac = _mm_mul_epu32(ab, cd);

    /* b = ab >> 32; */
    __m128i b = _mm_srli_epi64(ab, 32);

    /* bc = b * (cd & 0xFFFFFFFF); */
    __m128i bc = _mm_mul_epu32(b, cd);

    /* d = cd >> 32; */
    __m128i d = _mm_srli_epi64(cd, 32);

    /* ad = (ab & 0xFFFFFFFF) * d; */
    __m128i ad = _mm_mul_epu32(ab, d);

    /* high = bc + ad; */
    __m128i high = _mm_add_epi64(bc, ad);

    /* high <<= 32; */
    high = _mm_slli_epi64(high, 32);

    /* return ac + high; */
    return _mm_add_epi64(high, ac);
}

Compiler Explorer Note: The GCC vector extension version was also included below for comparison.

phuclv
  • 37,963
  • 15
  • 156
  • 475
EasyasPi
  • 430
  • 8
  • 8
  • 1
    With `-march=skylake-avx512`, we get AVX512DQ `vpmulqq` :) AVX512 finally introduced 64-bit element-size integer multiply. – Peter Cordes Jan 15 '19 at 07:16
  • 3
    And BTW, without AVX2, it's probably not worth it to use SIMD for 64x64 => 64-bit multiply, unless you already have your data in vectors. (One scalar `imul r64, r/m64` uop per 64-bit integer is pretty good. [Fastest way to multiply an array of int64\_t?](https://stackoverflow.com/q/37296289)). My answer there uses `mullo_epi32` (SSE4.1 or AVX2) to get both low*high products at the same time, although `pmulld` does take 2 uops on Intel CPUs. – Peter Cordes Jan 15 '19 at 07:19
  • True. I do want to mention that the method used for Neon does that as well, it does a vrev64 (32-bit wordswap) a 4xi32 multiply, vpaddl (pairwise add), shift left, then long multiply accumulate. If SSE had a pairwise add, that would probably be faster, but considering that NEON_2_SSE scalarizes that instruction, I assume it does not. – EasyasPi Jan 16 '19 at 14:33
  • 1
    [SSSE3 has `phaddd`](https://www.felixcloutier.com/x86/phaddw:phaddd), but it decodes to 2 shuffles that feed a vertical `paddd` uop; it's faster not to use it. I haven't looked at the details of my linked answer, but it does mention using psrlq / paddq / pand (3 total uops) instead of phadd + pshufd (3 shuffle uops + an ADD). More instructions but fewer uops, and much less shuffle-port bottleneck. Oh, [`vpaddl`](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473m/dom1361289973929.html) widens the elements. PHADDD has 2 inputs and 1 output, so no it's not a full replacement. – Peter Cordes Jan 16 '19 at 14:37
  • Hah, thanks! I was wondering why I implemented it like I did and I found my own answer, plus your improved version :) – Bas Apr 24 '22 at 21:01
5

I know this is an old question but I was actually looking for exactly this. As there's still no instruction for it I implemented the 64 bit multiply myself with the pmuldq as Paul R mentioned. This is what I came up with:

// requires g++ -msse4.1 ...

#include <emmintrin.h>
#include <smmintrin.h>

__m128i Multiply64Bit(__m128i a, __m128i b)
{
    auto ax0_ax1_ay0_ay1 = a;
    auto bx0_bx1_by0_by1 = b;
    
    // i means ignored

    auto ax1_i_ay1_i = _mm_shuffle_epi32(ax0_ax1_ay0_ay1, _MM_SHUFFLE(3, 3, 1, 1));
    auto bx1_i_by1_i = _mm_shuffle_epi32(bx0_bx1_by0_by1, _MM_SHUFFLE(3, 3, 1, 1));

    auto ax0bx0_ay0by0 = _mm_mul_epi32(ax0_ax1_ay0_ay1, bx0_bx1_by0_by1);
    auto ax0bx1_ay0by1 = _mm_mul_epi32(ax0_ax1_ay0_ay1, bx1_i_by1_i);
    auto ax1bx0_ay1by0 = _mm_mul_epi32(ax1_i_ay1_i, bx0_bx1_by0_by1);

    auto ax0bx1_ay0by1_32 = _mm_slli_epi64(ax0bx1_ay0by1, 32);
    auto ax1bx0_ay1by0_32 = _mm_slli_epi64(ax1bx0_ay1by0, 32);

    return _mm_add_epi64(ax0bx0_ay0by0, _mm_add_epi64(ax0bx1_ay0by1_32, ax1bx0_ay1by0_32));
}

Godbolt at SSE Multiply64Bit.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Bas
  • 474
  • 1
  • 4
  • 12
  • 3
    Have you done any benchmarking of the code versus using the general purpose registers for this? I'd be interested in the results as I'm doing a ton of 64 by 64 bit multiplies. – jeteon Aug 09 '17 at 02:08
  • I just did some benchmarking, it's still faster than scalar multiplication (compiled with cl /O2). Around 831600000 multiplications in avg. 0.37 secs on my somewhat powerful i7 5820k. Meanwhile same scalar multiplications took 1.71 on avg. so it's around 4 times faster, which is a little bit weird. I guess cl is really good at optimizing superscalar instructions – JukesOnYou Oct 24 '17 at 16:58
  • 1
    `_mm_mul_epi32` is a SSE4.1 instruction. `_mm_mul_epu32` is a SSE2 instruction. `_mm_mul_epu32` produces much better code, but it requires unsigned types. – jww Jan 09 '19 at 04:29
  • Actually, this doesn't work with epi32, you need the epu32 ones to get correct results. My bad, should have unit tested it (: Also you should do the shift after the add, instead of before. – Bas Apr 25 '22 at 12:10
3

You would need to implement your own 64 bit multiplication routine using 32 bit multiply operations. It's probably not going to be any more efficient than just doing this with scalar code though, particularly as there will be a lot of shuffling of the vectors to get all the required operations.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • From the top of my head, wasn't there a `pmuldqq` or something in SSE4 added? – Gunther Piez Jul 26 '13 at 16:44
  • There's a `pmuldq` in SSE4 which is a 32x32 => 64 bit multiply, so you could use that as a building block for a 64x64 bit multiply. – Paul R Jul 26 '13 at 16:55
  • Do you know the best scalar algorithm for this (assuming you only have 32-bit hardware)? This is what I would do. I would divide each number into its upper and lower 32-bit part and then do (a*b) = (al+ah)*(bl*bh) = t1 + t2 + t3 + t4 where t1=al*bl, t2=al*bh, t3=ah*bl t4=ah*bh. Each term would be a 64-bit number. Then t2 and t3 would have to be split again into low and high and the final number would be (a*b)l = t1 + t2l + t3l, (a*b)h = t4 + t2h + t3h + c, where c is any carry from (a*b)l. That's 4 mults, and 7 adds. Is this somewhere on SO? – Z boson Feb 18 '15 at 17:26
  • I've never implemented this myself, but it should be something like the method you suggest. I can't imagine that it will be very efficient though, so probably only worthwhile if you have other 64 bit SIMD operations that you want to combine it with. – Paul R Feb 18 '15 at 17:35
  • 1
    On Sandy Bridge, the general purpose multiplication and the vector multiplication are issued to distinct ports so you might be able to get the SSE multiplies for free if you are doing more than one set of multiplies. The adds and shuffles would be an issue though. If you do things that don't need port 5 much, these might also come out for free though. – jeteon Aug 09 '17 at 02:09