Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

Question

AXV2 doesn't have any integer multiplications with sources larger than 32-bit. It does offer 32 x 32 -> 32 multiplies, as well as 32 x 32 -> 64 multiplies¹, but nothing with 64-bit sources.

Let's say I need an unsigned multiply with inputs larger than 32-bit, but less or equal to 52-bits - can I simply use the floating point DP multiply or FMA instructions, and will the output be bit-exact when the integer inputs and results can be represented in 52 or fewer bits (i.e., in the range [0, 2^52-1])?

How about the more general case where I want all 104 bits of the product? Or the case where the integer product takes more than 52 bits (i.e., the product has non-zero values in bit indexes > 52) - but I want only the low 52 bits? In this latter case, the MUL is going to give me higher bits and round away some of the lower bits (perhaps that's what IFMA helps with?).

Edit: in fact, perhaps it could do anything up to 2^53, based on this answer - I had forgotten that the implied leading 1 before the mantissa effectively gives you another bit.

¹ Interestingly, the 64-bit product PMULDQ operation has half the latency and twice the throughput of 32-bit PMULLD version, as Mysticial explains in the comments.

Good question. You should be good to go, as the FMA instructions are equivalent to IEEE 754 double-precision arithmetic (with only one rounding stage at the end). Anything you can do with a double should be achievable. — Jason R, Dec 30 '16 at 23:07
Yeah, I probably put too much emphasis on FMA - it could be a plain vectorized `MUL` too. The question should is will the result be bit-exact equivalent to an integer multiply for all integers up to `2^52-1`? — BeeOnRope, Dec 31 '16 at 00:26
Yes, you will get exact integer arithmetic, provided that you never have intermediate results that fall outside that range. I've encountered software that took advantage of this, using `double` types when more-than-32-bit integer operations were needed (said software was written before 64-bit processors became mainstream, so processors with good FPUs could have better double-precision speed versus emulating a 64-bit integer operation in software). — Jason R, Dec 31 '16 at 03:12
[Prime95 version 28](http://www.mersenne.org/) uses AVX2 + FMA if available, since apparently that's faster than whatever they used before in version 27 (and definitely makes more heat, so it's a tougher stability-test for your hardware). I don't know *how* they use it, but prime-testing is inherently an integer problem, so it's worth looking into, since the source is available with apparently few restrictions on reuse: http://www.mersenne.org/download/#source — Peter Cordes, Dec 31 '16 at 16:25
Fun fact: [AVX-512 IFMA](https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_IFMA) adds two instructions designed to make this easier: 52-bit low and high integer multiplies. This is obviously designed to take advantage of existing multiply hardware, not because anyone really wants a 52-bit-only multiply. — Peter Cordes, Dec 31 '16 at 16:31
What about using FMA for 106-bit integers http://stackoverflow.com/a/31072201/2542702 — Z boson, Jan 03 '17 at 09:57
@JasonR - and for a straight MUL (rather than FMA) there will never be any intermediate result, right? Same (effectively) for FMA with addend zero? — BeeOnRope, Jan 03 '17 at 16:05
@Zboson - your link made me realize that (a) maybe I actually want a 106 bit product or that perhapsa (b) I actually wanted the low 52/53 bits of a product that may have some higher bits too, and a 106 bit solution would include that. I added a paragraph to the question starting with "How about the more general..." - and perhaps your answer there can be applicable here too (and I have some questions...). — BeeOnRope, Jan 03 '17 at 16:33
@PeterCordes No special tricks. They use FFTs to perform the large integer multiplication. And FFTs are entirely floating-point operations. — Mysticial, Jan 03 '17 at 17:23
@BeeOnRope On recent processors, `pmulld` is actually half the throughput of `pmuldq`. So this cancels out the "half rate" that you're observing. The most plausible reason for this is that the hardware consists of one 52 x 52 -> 104-bit multiplier per 64-bit SIMD lane. By suppressing the correct carry-propagation lanes, it can double as a pair of 23 x 23-bit -> 46-bit multipliers for single precision. — Mysticial, Jan 03 '17 at 18:01
For `pmuldq`, you have one operation per 64 bits. So you simply zero-pad the 32-bit inputs and truncate the 102-bit output down to 64 bits. For `pmulld`, you have two 32-bit multiplies per 64-bits of data. 32 is too large to fit into the single-precision multiply. So you must run it through the full 52-bit multiplier. But you must do it twice since you only have half as many multipliers as you have operands. Therefore `pmulld` has double the latency and half the throughput. — Mysticial, Jan 03 '17 at 18:02
@Mysticial - yes, that makes a lot of sense, especially given, for example, that `PMULLD` and fiends send 2 uops to either p0 or p1 - that really looks like doing the operation twice, once for half of the entries each time. What's interesting is that the latency (Skylake) is fully 10 - 2x the 5 cycle mul latency, whereas I'd expect it could be 6 (i.e., the two halves are independent, so start the second half in the second cycle). — BeeOnRope, Jan 03 '17 at 19:00
I guess maybe it was easily just to do it that way so you could do the second mul with the first half results already sitting in the alternate words, since the parallel approach would require some kind of final blend. — BeeOnRope, Jan 03 '17 at 19:01
I don't think it's been mentioned yet but AVX512 has 64x64 -> 64 with `vpmullq` (`_mm512_mullo_epi64`). @Mysticial suspects this will be slow. In any case I have AVX512 hardware (but KNL which are low power Silvermont cores so perhaps not reflective of high power cores) to test this now...If I find time. — Z boson, Jan 03 '17 at 20:36
Not that I am aware of. I don't think they will appear until after summer. Facebook and Google have them I think though (which has angered a lot of other server users). I think they will show up after AMD Ryzen or whatever they call it now. — Z boson, Jan 03 '17 at 20:39
@Zboson Without knowing much about Skylake Purley, I'm putting my money on `vpmullq` being a 3-uop instruction with a latency no more than 12 or 15 cycles. — Mysticial, Jan 03 '17 at 20:56
@Zboson - in case it wasn't clear above, I'm kind of nudging you to post an answer based on your earlier note about using FMA for 106 bit integers. At some point, I'll do it if you don't, but I want to ask some questions and it's weird to ask questions on my question :) — BeeOnRope, Jan 03 '17 at 23:03
@Zboson [Amazon has indicated that they will be making new Skylake Xeon-based instance types available on their Elastic Compute Cloud virtual machine service in early 2017 that support AVX512](https://aws.amazon.com/about-aws/whats-new/2016/11/coming-soon-amazon-ec2-c5-instances-the-next-generation-of-compute-optimized-instances/). — Jason R, Jan 04 '17 at 00:03
@BeeOnRope, I already started looking into it. I have not worked with `double-double` in several months so I am rusty. I think the main problem is that with integer multi-word arithmetic multiplication is complex but addition is simple whereas [with double-double it's the other way around](http://stackoverflow.com/q/30573443/2542702). I think this means that you have to move back and forth between integer128 and double-double. This requires a lot of bit manipulation which I have not worked out yet and I'm not even sure it will be efficient. It looks like with AVX512 they do this for you. — Z boson, Jan 04 '17 at 08:27
@BeeOnRope, You can use 2 vectors one for 0-31 bits and another for 32-51 bits that obviously adding them needs some efforts. after that do the 32->64 multiplication for both vectors and add the results in an appropriate way. then you have a 52->64 multiplication that the source is 2 separated 32 bits included one full 32 bits and one 20 bits used. thats it no need for any FP just separate them multiply and shift left the second vector results for generating the ultimate results on 52->64 multiplication and if you want more use 2 duplicated vectors for simulating all multiplications — ADMS, Jan 05 '17 at 22:02
@Zboson - at first I thought your answer on the other linked question solved all my problems (efficient computation of 52/53 low-bits of an integer multiply), but after re-reading it I became confused and asked you a ?n over there. — BeeOnRope, Jan 05 '17 at 23:15
@BeeOnRope have you tried implementing it yet? You can start with scalar code using `fma` from `math.h`. E.g. `int64_t ai = (1LL<<40)-1, bi = (1LL<<40)-1; double a = ai, b = bi; double p = a*b, e = fma(a,b,-p);`. And then look at `p` and `e`. Try different values of `ai` and `bi`. — Z boson, Jan 06 '17 at 09:34
@BeeOnRope, well based on Mysticial's comments [here](http://stackoverflow.com/questions/31069291/is-it-really-efficient-to-use-karatsuba-algorithm-in-64-bit-x-64-bit-multiplicat/31072201?noredirect=1#comment70231327_31072201) he has already done it. It's probably mostly useful for education. If I work it out I will post an answer. — Z boson, Jan 06 '17 at 19:38
@Zboson I'll do it if I have time. But the problem with the approach is that there are too many loose corners. The "optimal" answer will vary drastically depending on what your want the inputs and outputs to be. (integers? doubles? scaled doubles?) The explosion of solutions basically boils down to the large number of ways to merge it with the scaling and the [fast double<->int64 trick](http://stackoverflow.com/a/41148578/922184). Not to mention different scaling values depending on what range of inputs you want to support. And then you have the issue where the lower word may be negative... — Mysticial, Jan 06 '17 at 19:56
@Zboson And I forgot to mention that some the solutions may break with `-ffast-math`. So they're pretty flaky to begin with (even if provably correct on paper). At least that was the case under ICC. I know GCC does *some* fast-math optimizations with intrinsics. But I don't know if it does it enough to break these methods. — Mysticial, Jan 06 '17 at 20:10
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/132586/discussion-on-question-by-beeonrope-can-i-use-the-avx-fma-units-to-do-bit-exact). — Bhargav Rao, Jan 07 '17 at 19:44
@PeterCordes The AVX512-IFMA has the same latency/throughput as the FP FMA instructions (4|1 for the single-FMA i3-8121U). Though the verdict it still out on any possible bypass delays due to using an FP unit for an integer operation. (this bypass delay exists for all the existing integer multiply instructions) As far as usefulness outside of bignum arithmetic, `vpmadd52luq` will be useful as a 3x faster alternative to `vpmullq` if the user doesn't need the full 64-bit range. But it's something that would need to be done manually since it's unlikely a compiler would be able to infer this. — Mysticial, Jul 18 '18 at 17:53

score 16 · Accepted Answer · edited May 23 '17 at 11:46

Yes it's possible. But as of AVX2, it's unlikely to be better than the scalar approaches with MULX/ADCX/ADOX.

There's virtually an unlimited number of variations of this approach for different input/output domains. I'll only cover 3 of them, but they are easy to generalize once you know how they work.

Disclaimers:

All solutions here assume the rounding mode is round-to-even.
Use of fast-math optimization flags is not recommended as these solutions rely on strict IEEE.

Signed doubles in the range: [-2⁵¹, 2⁵¹]

//  A*B = L + H*2^52
//  Input:  A and B are in the range [-2^51, 2^51]
//  Output: L and H are in the range [-2^51, 2^51]
void mul52_signed(__m256d& L, __m256d& H, __m256d A, __m256d B){
    const __m256d ROUND = _mm256_set1_pd(30423614405477505635920876929024.);    //  3 * 2^103
    const __m256d SCALE = _mm256_set1_pd(1. / 4503599627370496);                //  1 / 2^52

    //  Multiply and add normalization constant. This forces the multiply
    //  to be rounded to the correct number of bits.
    H = _mm256_fmadd_pd(A, B, ROUND);

    //  Undo the normalization.
    H = _mm256_sub_pd(H, ROUND);

    //  Recover the bottom half of the product.
    L = _mm256_fmsub_pd(A, B, H);

    //  Correct the scaling of H.
    H = _mm256_mul_pd(H, SCALE);
}

This is the simplest one and the only one which is competitive with the scalar approaches. The final scaling is optional depending on what you want to do with the outputs. So this can be considered only 3 instructions. But it's also the least useful since both the inputs and outputs are floating-point values.

It is absolutely critical that both the FMAs stay fused. And this is where fast-math optimizations can break things. If the first FMA is broken up, then L is no longer guaranteed to be in the range [-2^51, 2^51]. If the second FMA is broken up, L will be completely wrong.

Signed integers in the range: [-2⁵¹, 2⁵¹]

//  A*B = L + H*2^52
//  Input:  A and B are in the range [-2^51, 2^51]
//  Output: L and H are in the range [-2^51, 2^51]
void mul52_signed(__m256i& L, __m256i& H, __m256i A, __m256i B){
    const __m256d CONVERT_U = _mm256_set1_pd(6755399441055744);     //  3*2^51
    const __m256d CONVERT_D = _mm256_set1_pd(1.5);

    __m256d l, h, a, b;

    //  Convert to double
    A = _mm256_add_epi64(A, _mm256_castpd_si256(CONVERT_U));
    B = _mm256_add_epi64(B, _mm256_castpd_si256(CONVERT_D));
    a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U);
    b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D);

    //  Get top half. Convert H to int64.
    h = _mm256_fmadd_pd(a, b, CONVERT_U);
    H = _mm256_sub_epi64(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U));

    //  Undo the normalization.
    h = _mm256_sub_pd(h, CONVERT_U);

    //  Recover bottom half.
    l = _mm256_fmsub_pd(a, b, h);

    //  Convert L to int64
    l = _mm256_add_pd(l, CONVERT_D);
    L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_D));
}

Building off of the first example, we combine it with a generalized version of the fast double <-> int64 conversion trick.

This one is more useful since you're working with integers. But even with the fast conversion trick, most of the time will be spent doing conversions. Fortunately, you can eliminate some of the input conversions if you are multiplying by the same operand multiple times.

Unsigned integers in the range: [0, 2⁵²)

//  A*B = L + H*2^52
//  Input:  A and B are in the range [0, 2^52)
//  Output: L and H are in the range [0, 2^52)
void mul52_unsigned(__m256i& L, __m256i& H, __m256i A, __m256i B){
    const __m256d CONVERT_U = _mm256_set1_pd(4503599627370496);     //  2^52
    const __m256d CONVERT_D = _mm256_set1_pd(1);
    const __m256d CONVERT_S = _mm256_set1_pd(1.5);

    __m256d l, h, a, b;

    //  Convert to double
    A = _mm256_or_si256(A, _mm256_castpd_si256(CONVERT_U));
    B = _mm256_or_si256(B, _mm256_castpd_si256(CONVERT_D));
    a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U);
    b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D);

    //  Get top half. Convert H to int64.
    h = _mm256_fmadd_pd(a, b, CONVERT_U);
    H = _mm256_xor_si256(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U));

    //  Undo the normalization.
    h = _mm256_sub_pd(h, CONVERT_U);

    //  Recover bottom half.
    l = _mm256_fmsub_pd(a, b, h);

    //  Convert L to int64
    l = _mm256_add_pd(l, CONVERT_S);
    L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_S));

    //  Make Correction
    H = _mm256_sub_epi64(H, _mm256_srli_epi64(L, 63));
    L = _mm256_and_si256(L, _mm256_set1_epi64x(0x000fffffffffffff));
}

Finally we get the answer to the original question. This builds off of the signed integer solution by adjusting the conversions and adding a correction step.

But at this point, we're at 13 instructions - half of which are high-latency instructions, not counting the numerous FP <-> int bypass delays. So it's unlikely this will be winning any benchmarks. By comparison, a 64 x 64 -> 128-bit SIMD multiply can be done in 16 instructions (14 if you pre-process the inputs.)

The correction step can be omitted if the rounding mode is round-down or round-to-zero. The only instruction where this matters is h = _mm256_fmadd_pd(a, b, CONVERT_U);. So on AVX512, you can override the rounding for that instruction and leave the rounding mode alone.

Final Thoughts:

It's worth noting that the 2⁵² range of operation can be reduced by adjusting the magic constants. This may be useful for the first solution (the floating-point one) since it gives you extra mantissa to use for accumulation in floating-point. This lets you bypass the need to constantly to convert back-and-forth between int64 and double like in the last 2 solutions.

While the 3 examples here are unlikely to be better than scalar methods, AVX512 will almost certainly tip the balance. Knights Landing in particular has poor throughput for ADCX and ADOX.

And of course all of this is moot when AVX512-IFMA comes out. That reduces a full 52 x 52 -> 104-bit product to 2 instructions and gives the accumulation for free.

I just found a corner case where these approaches will fail. Gonna need to dig a bit deeper. — Mysticial, Jan 08 '17 at 03:11
Deleting this answer for now. Counter example: `1294983567229061 * 1957786387885495` The double-rounding of `A*B` causes `L` to fall outside of the range `[-2^51, 2^51]`. This screws up the fast `double->int64` conversion trick. — Mysticial, Jan 08 '17 at 04:05
Found a different (and slightly faster) approach that fixes the double-rounding problem. Answer restored. — Mysticial, Jan 08 '17 at 05:14
Normalizing and un-normalizing is clever. I did not think of that. I'm still working through your answer and my own solution. — Z boson, Jan 08 '17 at 23:18
@Zboson Normalization is the same thing that makes the fast `double<->int64` conversion possible. And it's actually the key to an entire field of what I call, "FP abuse" - using the FPU to do all the wrong things for all the wrong reasons. The idea existed long ago, but I didn't start researching it until 2011 when Sandy Bridge (with AVX) made the FPU powerful enough to make this sort of crap viable from a performance perspective. — Mysticial, Jan 09 '17 at 18:17
Some of the "FP abuse" applications don't require strict IEEE and can get away with a rounding instruction instead. I generally prefer those since they are more resistant to fast-math optimizations. The int52 multiply here seems to fall just short of that. Which is why I had to fall back to the normalization approach. — Mysticial, Jan 09 '17 at 18:18
Do you have more examples of this "FP abuse" (e.g. answers on SO)? I would like to read up more on this. — Z boson, Jan 10 '17 at 03:04
@Zboson The only one I have that's open-sourced is [here](https://github.com/Mysticial/DigitViewer/blob/master/Source/DigitViewer/DigitConverter/u64d19/u64d19_forward_intrinsics_AVX2.h). It takes 8 64-bit integers and converts them into 8 sets of 19-chars holding the decimal representation of the integers. There are also SSE4.1 and AVX512 implementations in the same folder. — Mysticial, Jan 10 '17 at 03:38
This conversion operation is fundamentally an integer task, but done entirely using floating-point. The examples there use a combination of rounding and normalization. Unfortunately, they are all pretty unreadable as is usually the case for FP abusive code. — Mysticial, Jan 10 '17 at 03:40
Okay, the normalization trick is starting to make some sense after working through your `double -> int64_t` answer. — Z boson, Jan 10 '17 at 11:14

score 3 · Answer 2 · edited Apr 04 '18 at 08:11

One way to do multi-word integer arithmetic is with double-double arithmetic. Let's start with some double-double multiplication code

#include <math.h>
typedef struct {
  double hi;
  double lo;
} doubledouble;

static doubledouble quick_two_sum(double a, double b) {
  double s = a + b;
  double e = b - (s - a);
  return (doubledouble){s, e};
}

static doubledouble two_prod(double a, double b) {
  double p = a*b;
  double e = fma(a, b, -p);
  return (doubledouble){p, e};
}

doubledouble df64_mul(doubledouble a, doubledouble b) {
  doubledouble p = two_prod(a.hi, b.hi);
  p.lo += a.hi*b.lo;
  p.lo += a.lo*b.hi;
  return quick_two_sum(p.hi, p.lo);
}

The function two_prod can do integer 53bx53b -> 106b in two instructions. The function df64_mul can do integer 106bx106b -> 106b.

Let's compare this to integer 128bx128b -> 128b with integer hardware.

__int128 mul128(__int128 a, __int128 b) {
  return a*b;
}

The assembly for mul128

imul    rsi, rdx
mov     rax, rdi
imul    rcx, rdi
mul     rdx
add     rcx, rsi
add     rdx, rcx

The assembly for df64_mul (compiled with gcc -O3 -S i128.c -masm=intel -mfma -ffp-contract=off)

vmulsd      xmm4, xmm0, xmm2
vmulsd      xmm3, xmm0, xmm3
vmulsd      xmm1, xmm2, xmm1
vfmsub132sd xmm0, xmm4, xmm2
vaddsd      xmm3, xmm3, xmm0
vaddsd      xmm1, xmm3, xmm1
vaddsd      xmm0, xmm1, xmm4
vsubsd      xmm4, xmm0, xmm4
vsubsd      xmm1, xmm1, xmm4

mul128 does three scalar multiplications and two scalar additions/subtractions whereas df64_mul does 3 SIMD multiplications, 1 SIMD FMA, and 5 SIMD additions/subtractions. I have not profiled these methods but it does not seem unreasonable to me that df64_mul could outperform mul128 using 4-doubles per AVX register (change sd to pd and xmm to ymm).

It's tempting to say that the problem is switching back to the integer domain. But why is this necessary? You can do everything in the floating point domain. Let's look at some examples. I find it easier to unit test with float than with double.

doublefloat two_prod(float a, float b) {
  float p = a*b;
  float e = fma(a, b, -p);
  return (doublefloat){p, e};
}

//3202129*4807935=15395628093615
x = two_prod(3202129,4807935)
int64_t hi = p, lo = e, s = hi+lo
//p = 1.53956280e+13, e = 1.02575000e+05  
//hi = 15395627991040, lo = 102575, s = 15395628093615

//1450779*1501672=2178594202488
y = two_prod(1450779, 1501672)
int64_t hi = p, lo = e, s = hi+lo 
//p = 2.17859424e+12, e = -4.00720000e+04
//hi = 2178594242560 lo = -40072, s = 2178594202488

So we end up with different ranges and in the second case the error (e) is even negative but the sum is still correct. We could even add the two doublefloat values x and y together (once we know how to do double-double addition - see the code at the end) and get 15395628093615+2178594202488. There is no need to normalize the results.

But addition brings up the main problem with double-double arithmetic. Namely, addition/subtraction is slow e.g. 128b+128b -> 128b needs at least 11 floating point additions whereas with integers it only needs two (add and adc).

So if an algorithm is heavy on multiplication but light on addition then doing multi-word integer operations with double-double could win.

As a side note the C language is flexible enough to allow for an implementation where integers are implemented entirely through floating point hardware. int could be 24-bits (from single floating point), long could be 54-bits. (from double floating point), and long long could be 106-bits (from double-double). C does not even require two's compliment and therefore integers could use signed magnitude for negative numbers as is usual with floating point.

Here is working C code with double-double multiplication and addition (I have not implemented division or other operations such as sqrt but there are papers showing how to do this) in case somebody wants to play with it. It would be interesting to see if this could be optimized for integers.

//if compiling with -mfma you must also use -ffp-contract=off
//float-float is easier to debug. If you want double-double replace
//all float words with double and fmaf with fma 
#include <stdio.h>
#include <math.h>
#include <inttypes.h>
#include <x86intrin.h>
#include <stdlib.h>

//#include <float.h>

typedef struct {
  float hi;
  float lo;
} doublefloat;

typedef union {
  float f;
  int i;
  struct {
    unsigned mantisa : 23;
    unsigned exponent: 8;
    unsigned sign: 1;
  };
} float_cast;

void print_float(float_cast a) {
  printf("%.8e, 0x%x, mantisa 0x%x, exponent 0x%x, expondent-127 %d, sign %u\n", a.f, a.i, a.mantisa, a.exponent, a.exponent-127, a.sign);
}

void print_doublefloat(doublefloat a) {
  float_cast hi = {a.hi};
  float_cast lo = {a.lo};
  printf("hi: "); print_float(hi);
  printf("lo: "); print_float(lo);
}

doublefloat quick_two_sum(float a, float b) {
  float s = a + b;
  float e = b - (s - a);
  return (doublefloat){s, e};
  // 3 add
}

doublefloat two_sum(float a, float b) {
  float s = a + b;
  float v = s - a;
  float e = (a - (s - v)) + (b - v);
  return (doublefloat){s, e};
  // 6 add 
}

doublefloat df64_add(doublefloat a, doublefloat b) {
  doublefloat s, t;
  s = two_sum(a.hi, b.hi);
  t = two_sum(a.lo, b.lo);
  s.lo += t.hi;
  s = quick_two_sum(s.hi, s.lo);
  s.lo += t.lo;
  s = quick_two_sum(s.hi, s.lo);
  return s;
  // 2*two_sum, 2 add, 2*quick_two_sum = 2*6 + 2 + 2*3 = 20 add
}

doublefloat split(float a) {
  //#define SPLITTER (1<<27) + 1
#define SPLITTER (1<<12) + 1
  float t = (SPLITTER)*a;
  float hi = t - (t - a);
  float lo = a - hi;
  return (doublefloat){hi, lo};
  // 1 mul, 3 add
}

doublefloat split_sse(float a) {
  __m128 k = _mm_set1_ps(4097.0f);
  __m128 a4 = _mm_set1_ps(a);
  __m128 t = _mm_mul_ps(k,a4);
  __m128 hi4 = _mm_sub_ps(t,_mm_sub_ps(t, a4));
  __m128 lo4 = _mm_sub_ps(a4, hi4);
  float tmp[4];
  _mm_storeu_ps(tmp, hi4);
  float hi = tmp[0];
  _mm_storeu_ps(tmp, lo4);
  float lo = tmp[0];
  return (doublefloat){hi,lo};

}

float mult_sub(float a, float b, float c) {
  doublefloat as = split(a), bs = split(b);
  //print_doublefloat(as);
  //print_doublefloat(bs);
  return ((as.hi*bs.hi - c) + as.hi*bs.lo + as.lo*bs.hi) + as.lo*bs.lo;
  // 4 mul, 4 add, 2 split = 6 mul, 10 add
}

doublefloat two_prod(float a, float b) {
  float p = a*b;
  float e = mult_sub(a, b, p);
  return (doublefloat){p, e};
  // 1 mul, one mult_sub
  // 7 mul, 10 add
}

float mult_sub2(float a, float b, float c) {
  doublefloat as = split(a);
  return ((as.hi*as.hi -c ) + 2*as.hi*as.lo) + as.lo*as.lo;
}

doublefloat two_sqr(float a) {
  float p = a*a;
  float e = mult_sub2(a, a, p);
  return (doublefloat){p, e};
}

doublefloat df64_mul(doublefloat a, doublefloat b) {
  doublefloat p = two_prod(a.hi, b.hi);
  p.lo += a.hi*b.lo;
  p.lo += a.lo*b.hi;
  return quick_two_sum(p.hi, p.lo);
  //two_prod, 2 add, 2mul, 1 quick_two_sum = 9 mul, 15 add 
  //or 1 mul, 1 fma, 2add 2mul, 1 quick_two_sum = 3 mul, 1 fma, 5 add
}

doublefloat df64_sqr(doublefloat a) {
  doublefloat p = two_sqr(a.hi);
  p.lo += 2*a.hi*a.lo;
  return quick_two_sum(p.hi, p.lo);
}

int float2int(float a) {
  int M = 0xc00000; //1100 0000 0000 0000 0000 0000
  a += M;
  float_cast x;
  x.f = a;
  return x.i - 0x4b400000;
}

doublefloat add22(doublefloat a, doublefloat b) {
  float r = a.hi + b.hi;
  float s = fabsf(a.hi) > fabsf(b.hi) ?
    (((a.hi - r) + b.hi) + b.lo ) + a.lo :
    (((b.hi - r) + a.hi) + a.lo ) + b.lo;
  return two_sum(r, s);  
  //11 add 
}

int main(void) {
  //print_float((float_cast){1.0f});
  //print_float((float_cast){-2.0f});
  //print_float((float_cast){0.0f});
  //print_float((float_cast){3.14159f});
  //print_float((float_cast){1.5f});
  //print_float((float_cast){3.0f});
  //print_float((float_cast){7.0f});
  //print_float((float_cast){15.0f});
  //print_float((float_cast){31.0f});

  //uint64_t t = 0xffffff;
  //print_float((float_cast){1.0f*t});
  //printf("%" PRId64 " %" PRIx64 "\n", t*t,t*t);

  /*
    float_cast t1;
    t1.mantisa = 0x7fffff;
    t1.exponent = 0xfe;
    t1.sign = 0;
    print_float(t1);
  */
  //doublefloat z = two_prod(1.0f*t, 1.0f*t);
  //print_doublefloat(z);
  //double z2 = (double)z.hi + (double)z.lo;
  //printf("%.16e\n", z2);
  doublefloat s = {0};
  int64_t si = 0;
  for(int i=0; i<100000; i++) {
    int ai = rand()%0x800, bi = rand()%0x800000;
    float a = ai, b = bi;
    doublefloat z = two_prod(a,b);
    int64_t zi = (int64_t)ai*bi;
    //print_doublefloat(z);
    //s = df64_add(s,z);
    s = add22(s,z);
    si += zi;
    print_doublefloat(z);
    printf("%d %d ", ai,bi);
    int64_t h = z.hi;
    int64_t l = z.lo;
    int64_t t = h+l;
    //if(t != zi) printf("%" PRId64 " %" PRId64 "\n", h, l);

    printf("%" PRId64 " %" PRId64 " %" PRId64 " %" PRId64 "\n", zi, h, l, h+l);

    h = s.hi;
    l = s.lo;
    t = h + l;
    //if(si != t) printf("%" PRId64 " %" PRId64 "\n", h, l);

    if(si > (1LL<<48)) {
      printf("overflow after %d iterations\n", i); break;
    }
  }

  print_doublefloat(s);
  printf("%" PRId64 "\n", si);
  int64_t x = s.hi;
  int64_t y = s.lo;
  int64_t z = x+y;
  //int hi = float2int(s.hi);
  printf("%" PRId64 " %" PRId64 " %" PRId64 "\n", z,x,y);
}

Division and sqrt are pretty easy. Just call the native instruction for it. Then iterate one round of Newton's Method. That should get you down to something like +/- 2 ulp of double-double. Though I haven't cared enough to attempt a proof of that error bound. — Mysticial, Jan 11 '17 at 19:49

score 2 · Answer 3 · answered Jan 03 '17 at 22:51

2

Well, you certainly can do FP-lane operations on things that are integers. And they will always be exact: While there are SSE instructions that do not guarantee proper IEEE-754 precision and rounding, without exception they are the ones which do not have an integer range, so not the ones you're looking at anyway. Bottom line: Addition/subtraction/multiplication will always be exact in the integer domain, even if you're doing them on packed floats.

As for quad-precision floats (>52 bit mantissa), no, those aren't supported, and likely won't be in the foreseeable future. Just not much call for them. They showed up in a few SPARC-era workstation architectures, but honestly they were just a bandage over developers' incomplete understanding of how to write numerically stable algorithms, and over time they faded out.

Wide-integer operations turn out to be a really bad fit for SSE. I really tried to leverage it recently when I was implementing a big-integer library, and honestly it did me no good. x86 was designed for multi-word arithmetic; you can see it in operations such as ADC (which produces and consumes a carry bit) and IDIV (which allows the divisor to be twice as wide as the dividend as long as the quotient is no wider than the dividend, a constraint that makes it useless for anything but multiword division). But multiword arithmetic is by nature sequential, and SSE is by nature parallel. If you're lucky enough that your numbers have just enough bits to fit into a FP mantissa, congratulations. But if you have big integers in general, SSE is probably not going to be your friend.

answered Jan 03 '17 at 22:51

Sneftel

40,271
12
71
104

Can you clarify what you meant by "do not have an integer range"? Also I'm not sure I asked about quad-precision floats - but perhaps you mean the "where I want all 104 bits of the product" case? I'm actually interested especially in the next part, where I want _some subset_ of the 104 bits, and per z boson, it may be [possible via FMA](http://stackoverflow.com/a/31072201/2542702), because FMA has 106 bit partial results? – BeeOnRope Jan 03 '17 at 23:01
@BeeOnRope The instructions I was thinking of are trigonometry and approximate reciprocals, that sort of thing... the ones which don't produce integer results. – Sneftel Jan 03 '17 at 23:16
@BeeOnRope The double-double operations you mentioned are neato, but probably not useful in the context of integer ops, simply because converting them *back* to integers will be more expensive than doing the operations int-side in the first place. – Sneftel Jan 03 '17 at 23:18
I see, you were using `range` in a set/function [theoretical way](https://en.wikipedia.org/wiki/Range_(mathematics)). It makes sense now. – BeeOnRope Jan 03 '17 at 23:31
If I understand Z Boson's above-linked correctly, it seems you could arrange for a fairly trivial conversion - if you set the 52nd bit in the mantissa, can't you assume something about the exponent of the resulting answer, allowing simple masking extraction? – BeeOnRope Jan 03 '17 at 23:35
1

@BeeOnRope No, that's the thing. The less-significant result's mantissa won't necessarily be right below the more-significant result's, so you'd need to do a bunch of shifting. Between the shifts and moving in and out of XMMs, I just don't think it'll be a win compared to pure-integer ops, particularly since if you have FMA (introduced in Haswell), you probably have ADX and MULX (introduced in Broadwell) too. – Sneftel Jan 03 '17 at 23:41
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/132243/discussion-between-beeonrope-and-sneftel). – BeeOnRope Jan 03 '17 at 23:41
I agree that it's probably not a win pre-AVX512. You have to beat four MUL (or MULX) which each to 64x64->128. You can do the double-double in a mul and a FMA but then you need to convert to int. You may also need to convert from int to double-double. These conversions combined with the double-double mult are likely going to be slower than four MUL. – Z boson Jan 05 '17 at 08:46
However, your statement "But multiword arithmetic is by nature sequential, and SSE is by nature parallel." shows a lack of understanding of SIMD. Nobody who understands SIMD would pack e.g. the low and hi part of a 128-bit integer into one SSE register. You would use two SSE registers for two 128-bit integers: one for the low 64-bits and one for the high 64-bits. Also, [I suspect that with AVX512 you can already beat ADC/ADCX/ADCO](http://stackoverflow.com/questions/27923192/practical-bignum-avx-sse-possible/27978043#27978043). That, along with IFMA in AVX512 are worth considering. – Z boson Jan 05 '17 at 08:49
@Zboson Of course the real wins from SIMD are from using them for, well, SIMD. It looks like the OP isn't interested in multi-lane processing, though -- just the wider variety of operations available through SSE. – Sneftel Jan 05 '17 at 08:54
@Zboson and @Sneftel - to be clear, I am interested in multi-lane processing (whatever that means) but not multi-word arithmetic per-se. What I want is fast, fixed size multiplication, with larger sizes being better. For example, I may want a lo(N x N) -> N multiplication but for whatever size N ends up maximizing `f(N)` per unit time, where `f(N)` looks something approximately like `N + (constant)`. So long multi-word multiplication doesn't help, because cost is super-linear, but SIMD can definitely help here compared to `MULX`. – BeeOnRope Jan 05 '17 at 22:41
2

@BeeOnRope My experience on Haswell with AVX2 is that it really depends on what the inputs and desired outputs are. If both are scalar values in GPRs, then IMUL or MULX all the way. If both are 256-bit vector registers, then they are tied - but slightly leaning in favor of vector. If you choose the scalar route, you need to extract 4 lanes from each operand, call scalar IMUL or MULX, and repack them. – Mysticial Jan 05 '17 at 23:04
In the vector case, you break it down into vector 32 x 32 -> 64-bit multiplies with adds/shifts/shuffles. In the IMUL case (low 64 bits), the SIMD approach is usually faster. If you want the high 64 bits, it's very close. Granted, I don't think I've ever properly benchmarked that case. Either way, it sucks. – Mysticial Jan 05 '17 at 23:05
FWIW, I always want either low bits (i.e., N x N -> N) or all the bits (i.e., N x N -> 2N), never just the high bits, and the inputs start in vector regs. – BeeOnRope Jan 05 '17 at 23:14
@BeeOnRope, have you worked through this yet http://www.hackersdelight.org/hdcodetxt/muldws.c.txt ? See also my [answer](http://stackoverflow.com/questions/28766755/compute-the-doubleword-product-signed-of-two-words-given-the-lower-word-produc). Once you have the scalar part working [here](http://stackoverflow.com/a/28827013/2542702) is how to do it with SIMD. My solution there could probably be optimized with regards to port pressure which I did not consider or appreciate at the time (I only considered number of instructions) but the idea is the same. – Z boson Jan 06 '17 at 08:40

Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?

3 Answers3

Linked