practical BigNum AVX/SSE possible?

Question

SSE/AVX registers could be viewed as integer or floating point BigNums. That is, one could neglect that there exist lanes at all. Does there exist an easy way to exploit this point of view and use these registers as BigNums either singly or combined? I ask because from what little I've seen of BigNum libraries, they almost universally store and do arithmetic on arrays, not on SSE/AVX registers. Portability?

Example:

Say you store the contents of a SSE register as a key in a std::set, you could compare these contents as a BigNum.

Of course it's possible, just insanely inconvenient, inefficient and slow. When you do addition with arrays of _limbs_ (32/64-bit words), it's easy to use the x86 Carry flag to propagate the carry bit. The lanes of SSE registers do _not_ have carry flags, which means overflow must be detected in a different way (more computationally intense), and even if you did detect overflow you have then the problem of complicated SSE/AVX shuffles to move up the carries, and you have to do this `N-1` times for `N`-limb bignums. Then what happens if you need to extend a bignum beyond 128-bit/256-bits...? — Iwillnotexist Idonotexist, Jan 13 '15 at 13:28
You join 2 or more of the registers together, using gcc/clang/icc [vector extensions](https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). You can write an answer, why you think it would be impractical. The thing is, I think gcc maps arrays to SIMD registers badly, but it maps SIMD registers in the reverse direction readily and w/o any problems. — user1095108, Jan 13 '15 at 13:35
What you've linked to is not a way to dynamically extend a bignum at runtime. What you linked to is a method to declare limited- and fixed-size vectors (with lanes) at compile time. You still have exactly the same problems I've listed above, and you still cannot easily detect and propagate carries from the lower to the upper limbs using those extensions. — Iwillnotexist Idonotexist, Jan 13 '15 at 13:41
http://stackoverflow.com/questions/12200698/is-it-possible-to-use-sse-v2-to-make-a-128-bit-wide-integer — phuclv, Jan 13 '15 at 13:55
[I have determined that there is no efficient way to do bignum multiplication with SIMD with current Intel hardware](https://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit/28811226#28811226). — Z boson, Mar 02 '15 at 15:31
@Zboson: [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973) - yes if you redesign your storage format to leave some spare bits so you can delay normalization / carry propagation. — Peter Cordes, Mar 18 '21 at 21:00

Z boson · Accepted Answer · 2015-01-23T05:11:50.870

16

I think it may be possible to implement BigNum with SIMD efficiently but not in the way you suggest.

Instead of implementing a single BigNum using a SIMD register (or with an array of SIMD registers) you should process multiple BigNums at once.

Let's consider 128-bit addition. Let 128-bit integers be defined by a pair of high and low 64-bit values and let's assume we want to add a 128-bit integer (y_low, y_high) to a 128-bit integer (x_low, x_high). With the scalar 64-bit registers this requires only two instructions

add rax, rdi // x_low  += y_low;
adc rdx, rsi // x_high += y_high + (x_low < y_low);

With SSE/AVX the problem, as others have explain, is that there is no SIMD carry flags. The carry flag has to be calculated and then added. This requires a 64-bit unsigned comparison. The only realistic option for this with SSE is from the AMD XOP instruction vpcomgtuq

vpaddq      xmm2, xmm0, xmm2 // x_low  += y_low;
vpcomgtuq   xmm0, xmm0, xmm2 // x_low  <  y_low
vpaddq      xmm1, xmm1, xmm3 // x_high += y_high
vpsubq      xmm0, xmm1, xmm0 // x_high += xmm0

This uses four instructions to add two pairs of 128-bit numbers. With the scalar 64-bit registers this requires four instructions as well (two add and two adc).

With AVX2 we can add four pairs of 128-bit numbers at once. But there is no 256-bit wide 64-bit unsigned instruction from XOP. Instead we can do the following for a<b:

__m256i sign64 = _mm256_set1_epi64x(0x8000000000000000L);
__m256i aflip = _mm256_xor_si256(a, sign64);
__m256i bflip = _mm256_xor_si256(b, sign64);
__m256i cmp = _mm256_cmpgt_epi64(aflip,bflip);

The sign64 register can be precomputed so only three instructions are really necessary. Therefore, adding four pairs of 128-bit numbers with AVX2 can be done with six instructions

vpaddq
vpaddq
vpxor
vpxor
vpcmpgtq 
vpsubq

whereas the scalar registers need eight instructions.

AVX512 has a single instruction for doing 64-bit unsigned comparison vpcmpuq. Therefore, it should be possible to add eight pairs of 128-bit numbers using only four instructions

vpaddq
vpaddq
vpcmpuq
vpsubq

With the scalar register it would require 16 instructions to add eight pairs of 128-bit numbers.

Here is a table with a summary of the number of SIMD instructions (called nSIMD) and the number of scalar instructions (called nscalar) necessary to add a number of pairs (called npairs) of 128-bit numbers

              nSIMD      nscalar     npairs
SSE2 + XOP        4           4           2
AVX2              6           8           4
AVX2 + XOP2       4           8           4
AVX-512           4          16           8

Note that XOP2 does not exist yet and I am only speculating that it may exist at some point.

Note also that to do this efficiently the BigNum arrays needs to be stored in an array of struct of array (AoSoA) form. For example using l to mean the lower 64-bits and h to mean the high 64-bits an array of 128-bit integers stores as an array of structs like this

lhlhlhlhlhlhlhlh

should instead be stored using an AoSoA like this

SSE2:   llhhllhhllhhllhh
AVX2:   llllhhhhllllhhhh
AVX512: llllllllhhhhhhhh

edited Jan 23 '15 at 05:11

answered Jan 16 '15 at 05:52

Z boson

32,619
11
123
226

Just as I thought. As the SIMD registers grow ever larger, there will be ever more space for BigNums (and everything else) in them and ever less instructions necessary to do arithmetic on them. Also, register memory is the fastest memory available to a programmer. Can you clarify your table a little? – user1095108 Jan 16 '15 at 08:35
1

Well seen (+1): do slicing. But: 1) 128-bit ints are an extended (big) integer fixed-precision type, not necessarily an arbitrary precision type. 2) There is additional overhead and luck that is not counted here. How often does one have `n` bignums of the same size and number of limbs and want to apply the same operations to each? How to efficiently load them in registers? The number of loads/(de)interleaves/inserts/extracts/stores required means that the breakeven point is higher than your table suggests. `VGATHER` could do, but AVX2 is rare, as 3) is `XOP` outside of AMD's latest offerings. – Iwillnotexist Idonotexist Jan 16 '15 at 09:36
@IwillnotexistIdonotexist, the loads can be done efficiently assuming the 128-bit integers are stored in AoSoA form (e.g. with AVX2 lolololohihihihi). What do you mean "is XOP outside of AMD's latest offering." Was that suppose to be a question? Your answer is better in general. I was mostly trying to find a case where SIMD could be useful for big intger precision type. If the precision is not fixed one could use the largest precision for each of them in a SIMD widht block. – Z boson Jan 16 '15 at 14:52
@user1095108, I clarified the table a bit but I'm not sure what you wanted. Please give Iwillnotexist Idonotexist the accepted answer. His answer is more general than mine. My answer is more for special cases which may exist. – Z boson Jan 16 '15 at 15:07
@Zboson Nay; I upvoted your (better) answer because I'm ashamed the slicing approach didn't occur to me earlier; Especially since I myself have used it before. It's a valid approach within its limits. W.r.t. XOP and AVX2: I was merely pointing out that they occur rarely enough that most people don't get to use them; I rarely see codepaths versioned AMD/Intel, more commonly they're versioned generic C/SSE2 and on occasion AVX. – Iwillnotexist Idonotexist Jan 16 '15 at 15:08
@IwillnotexistIdonotexist, [I looked into 64b * 64b to 128b multiplication](https://stackoverflow.com/questions/28807341/simd-signed-with-unsigned-multiplication-for-64-bit-64-bit-to-128-bit) with SIMD. I don't think it's possible to beat `mul` so you're probably right that there is no efficient SIMD bignum method even with AVX512. – Z boson Mar 02 '15 at 11:45
8

You're gonna have a hard time believing this, but just for the record, I found a way to (horizontally) vectorize a bignum add. IOW, add two `__m512i` together as if they were 512-bit integers. It can be done in under 10 instructions with a critical path short enough to be throughput bound. IOW, it's probably gonna beat a chain of 8 add-with-carry instructions. Unfortunately, it does require AVX512-DQ to be efficient. The idea is based on the Kogge-Stone Adder. But for now I'm still working out the details (and a proof) and I want to test-drive it first on y-cruncher with real hardware. – Mysticial Jan 21 '17 at 22:08
@Mysticial, do you have access to AVX512 hardware now? – Z boson Jan 23 '17 at 10:12
@Zboson No. And I most likely won't until Skylake Purley. Btw, I found a way to do the Kogge-Stone add on AVX512F and AVX2 - albeit less efficiently. On both Haswell and Skylake, AVX2 version falls about 10% short of the chained `adc` when adding a pair of 64000-bit integers. (from L1 cache) Skylake has the shorter `adc` latency, but it also has higher int-SIMD throughput. So the effects cancel out. – Mysticial Jan 23 '17 at 16:32
I also tried implementing a horizontal multiply with `__m256i` AVX2 using the same idea. But before I finished, I could tell it would be around 4x slower than MULX + ADCX/ADOX. So I stopped there. That one is not gonna be a win even with AVX512. – Mysticial Jan 23 '17 at 16:34
@Mysticial, how are you calculating the performance of AVX512 without hardware. Are you just looking up the latency and throughput numbers or do you emulate this somehow? I know you are emulating the AVX512 intrinsics. Did you make a toy AVX512 emulator? – Z boson Jan 24 '17 at 11:35
@Mysticial, could you give an answer with your `__m512i` horizontal add solution describing the Kogge-Stone method? I think others would be interested in this. What about multiplication then? You add one 512b number but then how do you do the mult? I guess that's what you mean by your comment above. Is this 512b add useful then? – Z boson Jan 24 '17 at 11:38
"how are you calculating the performance of AVX512 without hardware" - I'm not, but the instructions involved are cheap. For a vector-add with carry-in/out: The AVX2 version needs 13 instructions + LUT. The critical path is 3 cycles. It is throughput-bound and only 10% slower than a chain of 4 `adcs` + 4 loads + 4 stores. The AVX512-DQ version needs 9 instructions all of which are cheap. The critical path is 3 mask instructions. What are the chances that it will beat 8 `adc`s + 8 loads + 8 stores? – Mysticial Jan 24 '17 at 15:04
I'm less sure about KNL since it has long bypass delays and less OOE capability. For multiplication, it's just a adding up a bunch of partial product rows. So the problem reduces to additions. The problem is that 32 x 32 -> 64 SIMD multiply requires too many instructions to be competitive with MULX. – Mysticial Jan 24 '17 at 15:07
9

If you're interested, I've written a blog about the approach [here](http://www.numberworld.org/y-cruncher/internals/addition.html#ks_add). I was eventually able to (informally) prove it correct. But the more I investigate it, the less useful it gets. – Mysticial Jan 29 '17 at 22:28
@Mysticial, I had not read your blog before (at least not the description of algorithms). That's a great resource, thanks! – Z boson Jan 30 '17 at 09:30
That page is new. So you couldn't have read it. But the rest of the site has been there for years. – Mysticial Jan 30 '17 at 21:06

Iwillnotexist Idonotexist · Answer 2 · 2015-01-13T14:09:39.273

Moved from comment above

It is possible to do this but it is not done because it is not particularly convenient to implement bignums in vector registers.

For the simple task of addition, it is trivial to use the x86 EFLAGS/RFLAGS' register's Carry flag to propagate the addition's carries from the lowest "limb" up (to use the GMP terminology), and loop over an arbitrary amount of limbs laid in an array.

Contrariwise, the lanes of SSE/AVX registers do not have carry flags, which means overflow must be detected in a different way involving comparisons to detect wraparound, which is more computationally intense. Moreover, if an overflow is detected in one limb, it would have to be propagated by an ugly shuffle "upwards", and then added, and this addition may cause another overflow and carry-over, up to N-1 times for an N-limb bignum. Then, once a sum brings a bignum beyond 128-bit/256-bits (or beyond 128 bits x # of registers), you'd have to move it to an array anyways.

Therefore, much special-case code would be needed, and it would not be any faster (in fact, much slower), just for addition. Imagine what it would take for multiplication? or gasp, division?

you could do the initial add of words in parallel, calc the carries (compare) in parallel and add them in parallel. Does sound too bad, does it? — user1095108, Jan 07 '23 at 22:17

phuclv · Answer 3 · 2017-05-29T08:56:58.893

3

It's possible, but not practical.

As I said in the other answer, there's no carry flag in AVX/SSE so it's impossible to do addition and subtraction efficiently. And to do multiplications you'll need a lot of shuffling to get the widening multiply result in the desired position.

If you are allowed to work with the newer Haswell/Broadwell microarchitecture, the solution would be MULX in BMI2 and ADOX, ADCX in ADX. You can read about them here.

edited May 29 '17 at 08:56

answered Jan 13 '15 at 15:55

phuclv

37,963
15
156
475

can you please answer my other [question](http://stackoverflow.com/questions/27929402/can-someone-explain-this-sse-bignum-comparison), related to your answer. – user1095108 Jan 13 '15 at 19:18

practical BigNum AVX/SSE possible?

3 Answers3

Linked

Related