Why doesn't my processor have built-in BigInt support?

Question

As far as I understood it, BigInts are usually implemented in most programming languages as arrays containing digits, where, eg.: when adding two of them, each digit is added one after another like we know it from school, e.g.:

 246
 816
 * *
----
1062

Where * marks that there was an overflow. I learned it this way at school and all BigInt adding functions I've implemented work similar to the example above.

So we all know that our processors can only natively manage ints from 0 to 2^32 / 2^64.

That means that most scripting languages in order to be high-level and offer arithmetics with big integers, have to implement/use BigInt libraries that work with integers as arrays like above. But of course this means that they'll be far slower than the processor.

So what I've asked myself is:

Why doesn't my processor have a built-in BigInt function?

It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.

Seems like a fine idea to me, so why isn't there something like that?

BigInts aren't implemented with strings, they're implemented with arrays of bytes. If you think of a byte array as a string in base-256 notation, though, then what you said is correct. — JSBձոգչ, Apr 12 '10 at 20:02
I think I wrote about strings because the last BigInt lib I've implemented in my own stupid language which doesn't even have arrays. Of course I ment that. — ol., Apr 12 '10 at 20:04
Why isn't there a dynamic pony and unicorn drawing routine in the processor! — Paul Tomblin, Apr 12 '10 at 20:05
@Paul: Some time until the next 1th April now. Let's hope they don't accept people from meta at Intel until then *shocked* No seriously, why isn't there built-in bigint support? ^^ — ol., Apr 12 '10 at 20:06
In Python this is implemented in the [Decimal](http://docs.python.org/library/decimal.html) module. — smci, Jul 14 '11 at 23:51
Interest in software libraries for bigint math was much more common back when the CPU registers were only 8 bits wide. The only way to do significant math then was via software libraries. Now with 64 bit integer registers and hardware floating point almost everywhere, using software libraries for simple math is more of a curiosity, not a critical need. — dthorpe, Jan 09 '12 at 17:57
Another reason why hardware makers are not in a rush to implement string math operations in hardware (again) is performance. The process you describe (read byte, add, write result) would be at least 32 times slower than reading a 32 bit integer, adding, and writing the 32 bit result back to memory. The reality is the string math support would have to do a lot more than that because you can't assume the CPU can store the entire intermediate result in a CPU register. And then there are things like division that are very difficult to do one byte at a time. — dthorpe, Jan 09 '12 at 18:02
Once you get to larger non-fixed sized operations (especially multiplication/division), there are many possible implementation choices each with their own tradeoff (and the differences are large). Just hardcoding this into the processor is like using a certain version of GMP without an option for upgrade or change. Also, libraries like GMP are quite large, while chip manufacturers like their operations to be relatively simple, and verifiable. — Realz Slaw, Oct 07 '12 at 04:27
@dthorpe New instructions are being introduced on Intel® Architecture Processors to enable fast implementations of large integer arithmetic. — Charles Okwuagwu, Aug 25 '15 at 00:29

score 9 · Accepted Answer · edited May 24 '18 at 21:38

There are simply too many issues that require the processor to deal with a ton of stuff which isn't its job.

Suppose that the processor DID have that feature. We can work out a system where we know how many bytes are used by a given BigInt - just use the same principle as most string libraries and record the length.

But what would happen if the result of a BigInt operation exceeded the amount of space reserved?

There are two options:

It'll wrap around inside the space it does have or
It'll use more memory.

The thing is, if it did 1), then it's useless - you'd have to know how much space was required beforehand, and that's part of the reason you'd want to use a BigInt - so you're not limited by those things.

If it did 2), then it'll have to allocate that memory somehow. Memory allocation is not done in the same way across OSes, but even if it were, it would still have to update all pointers to the old value. How would it know what were pointers to the value, and what were simply integer values containing the same value as the memory address in question?

Processors *do* support add-with-carry, and multiply that produces the full low and high result (64b * 64b => 128b). Software has to use these in a loop, but that doesn't make it slower than a microcode loop (like if `rep adcq` existed to do src += dst, taking 2 pointers and a length as input). I added an answer with more details. — Peter Cordes, Jun 17 '16 at 20:57

score 8 · Answer 2 · answered Apr 12 '10 at 20:06

8

Binary Coded Decimal is a form of string math. The Intel x86 processors have opcodes for direct BCD arthmetic operations.

answered Apr 12 '10 at 20:06

dthorpe

35,318
5
75
119

Which hardly anyone uses anyway. – Michael Myers Apr 12 '10 at 20:18
@mmyers - wasn't there a spate of computers showing the year as 2016 this year because somebody in the Windows team screwed up the BCD to decimal conversion? – Paul Tomblin Apr 12 '10 at 20:40
2

http://teck.in/year-2016-y2k16-computer-bug-hits-texting-in-us-and-banking-in-australia.html – Paul Tomblin Apr 12 '10 at 20:42
3

@mmyers: OP wasn't asking if it was popular. :P BCD is/was used in financial applications for fixed-point math to avoid loss of precision / roundoff errors. – dthorpe Apr 12 '10 at 21:09
@dthorpe: Yes, but if BCD isn't used very much, why would processor makers implement another alternative? At least that's what I would think. – Michael Myers Apr 12 '10 at 21:24
@mmyers: True, BCD was not as popular as Intel had hoped, and as you suggest probably casts a shadow over anybody else heading down that path. – dthorpe Apr 12 '10 at 21:47
2

+1 for the mention. Also, interestingly (or not), IEEE-754 defines both Binary and Decimal radix variants. http://en.wikipedia.org/wiki/IEEE_754-2008. – Apr 12 '10 at 22:37

score 4 · Answer 3 · edited Jun 20 '20 at 09:12

It would work like any other BigInt library, only (a lot) faster and at a lower level: Processor fetches one digit from the cache/RAM, adds it, and writes the result back again.

Almost all CPUs do have this built-in. You have to use a software loop around the relevant instructions, but that doesn't make it slower if the loop is efficient. (That's non-trivial on x86, due to partial-flag stalls, see below)

e.g. if x86 provided rep adc to do src += dst, taking 2 pointers and a length as input (like rep movsd to memcpy), it would still be implemented as a loop in microcode.

It would be possible for a 32bit x86 CPU to have an internal implementation of rep adc that used 64bit adds internally, since 32bit CPUs probably still have a 64bit adder. However, 64bit CPUs probably don't have a single-cycle latency 128b adder. So I don't expect that having a special instruction for this would give a speedup over what you can do with software, at least on a 64bit CPU.

Maybe a special wide-add instruction would be useful on a low-power low-clock-speed CPU where a really wide adder with single-cycle latency is possible.

The x86 instructions you're looking for are:

adc: add with carry / sbb: subtract with borrow
mul: full multiply, producing upper and lower halves of the result: e.g. 64b*64b => 128b
div: dividend is twice as wide as the other operands, e.g. 128b / 64b => 64b division.

Of course, adc works on binary integers, not single decimal digits. x86 can adc in 8, 16, 32, or 64bit chunks, unlike RISC CPUs which typically only adc at full register width. (GMP calls each chunk a "limb"). (x86 has some instructions for working with BCD or ASCII, but those instructions were dropped for x86-64.)

imul / idiv are the signed equivalents. Add works the same for signed 2's complement as for unsigned, so there's no separate instruction; just look at the relevant flags to detect signed vs. unsigned overflow. But for adc, remember that only the most-significant chunk has the sign bit; the rest are essential unsigned.

ADX and BMI/BMI2 add some instructions like mulx: full-multiply without touching flags, so it can be interleaved with an adc chain to create more instruction-level parallelism for superscalar CPUs to exploit.

In x86, adc is even available with a memory destination, so it performs exactly like you describe: one instruction triggers the whole read / modify / write of a chunk of the BigInteger. See example below.

Most high-level languages (including C/C++) don't expose a "carry" flag

Usually there aren't intrinsics add-with-carry directly in C. BigInteger libraries usually have to be written in asm for good performance.

However, Intel actually has defined intrinsics for adc (and adcx / adox).

unsigned char _addcarry_u64 (unsigned char c_in, unsigned __int64 a, \
                             unsigned __int64 b, unsigned __int64 * out);

So the carry result is handled as an unsigned char in C. For the _addcarryx_u64 intrinsic, it's up to the compiler to analyse the dependency chains and decide which adds to do with adcx and which to do with adox, and how to string them together to implement the C source.

IDK what the point of _addcarryx intrinsics are, instead of just having the compiler use adcx/adox for the existing _addcarry_u64 intrinsic, when there are parallel dep chains that can take advantage of it. Maybe some compilers aren't smart enough for that.

Here's an example of a BigInteger add function, in NASM syntax:

;;;;;;;;;;;; UNTESTED ;;;;;;;;;;;;
; C prototype:
; void bigint_add(uint64_t *dst, uint64_t *src, size_t len);
;   len is an element-count, not byte-count

global bigint_add
bigint_add:   ; AMD64 SysV ABI: dst=rdi, src=rsi, len=rdx

                              ; set up for using dst as an index for src
    sub    rsi, rdi           ;  rsi -= dst.  So orig_src = rsi + rdi

    clc                           ;  CF=0 to set up for the first adc
           ; alternative: peel the first iteration and use add instead of adc

.loop:
    mov    rax, [rsi + rdi]   ; load from src
    adc    rax, [rdi]         ;  <================= ADC with dst
    mov    [rdi], rax         ; store back into dst.  This appears to be cheaper than  adc  [rdi], rax  since we're using a non-indexed addressing mode that can micro-fuse

    lea    rdi,  [rdi + 8]    ; pointer-increment without clobbering CF
    dec    rdx                ; preserves CF
    jnz    .loop              ; loop while(--len)

    ret

On older CPUs, especially pre-Sandybridge, adc will cause a partial-flag stall when reading CF after dec writes other flags. Looping with a different instruction will help for old CPUs which stall while merging partial-flag writes, but not be worth it on SnB-family.

Loop unrolling is also very important for adc loops. adc decodes to multiple uops on Intel, so loop overhead is a problem, esp if you have extra loop overhead from avoiding partial-flag stalls. If len is a small known constant, a fully-unrolled loop is usually good. (e.g. compilers just use add/adc to do a uint128_t on x86-64.)

adc with a memory destination appears not to be the most efficient way, since the pointer-difference trick lets us use a single-register addressing mode for dst. (Without that trick, memory-operands wouldn't micro-fuse).

According to Agner Fog's instruction tables for Haswell and Skylake, adc r,m is 2 uops (fused-domain) with one per 1 clock throughput, while adc m, r/i is 4 uops (fused-domain), with one per 2 clocks throughput. Apparently it doesn't help that Broadwell/Skylake run adc r,r/i as a single-uop instruction (taking advantage of ability to have uops with 3 input dependencies, introduced with Haswell for FMA). I'm also not 100% sure Agner's results are right here, since he didn't realize that SnB-family CPUs only micro-fuse indexed addressing modes in the decoders / uop-cache, not in the out-of-order core.

Anyway, this simple not-unrolled-at-all loop is 6 uops, and should run at one iteration per 2 cycles on Intel SnB-family CPUs. Even if it takes an extra uop for partial-flag merging, that's still easily less than the 8 fused-domain uops that can be issued in 2 cycles.

Some minor unrolling could get this close to 1 adc per cycle, since that part is only 4 uops. However, 2 loads and one store per cycle isn't quite sustainable.

Extended-precision multiply and divide are also possible, taking advantage of the widening / narrowing multiply and divide instructions. It's much more complicated, of course, due to the nature of multiplication.

It's not really helpful to use SSE for add-with carry, or AFAIK any other BigInteger operations.

If you're designing a new instruction-set, you can do BigInteger adds in vector registers if you have the right instructions to efficiently generate and propagate carry. That thread has some back-and-forth discussion on the costs and benefits of supporting carry flags in hardware, vs. having software generate carry-out like MIPS does: compare to detect unsigned wraparound, putting the result in another integer register.

Update: [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973) - yes if you arrange your data format to leave some spare bits, allowing you to delay normalization. — Peter Cordes, Apr 06 '21 at 15:49

score 3 · Answer 4 · answered Apr 12 '10 at 20:15

3

Suppose the result of the multiplication needed 3 times the space (memory) to be stored - where would the processor store that result ? How would users of that result, including all pointers to it know that its size suddenly changed - and changing the size might need it to relocate it in memory cause extending the current location would clash with another variable.

This would create a lot of interaction between the processor, OS memory managment, and the compiler that would be hard to make both general and efficient.

Managing the memory of application types is not something the processor should do.

answered Apr 12 '10 at 20:15

leeeroy

11,216
17
52
54

2

IIRC, a multiplication result will only ever require a number of bits equal to the sum of the widths of the operands. So assuming we're multiplying two equal-sized variables, you'll never need more than 2X the space for the result. – rmeador Apr 12 '10 at 21:33
Sure, but you don't know that size until runtime, which is the hard issue. – leeeroy Apr 13 '10 at 17:06
@leeeroy you don't know the size of pretty much anything until runtime. What's the hard issue about it? Just count the bytes and allocate memory enough to hold the result of such operation. Because if what you say had any meaning, we shouldn't use computers because at some point they may not be able to hold our data. Nowadays with many gbs of memory you will hardly hit the limit with BCD. – Pablo Ariel Oct 22 '18 at 20:28

score 1 · Answer 5 · answered Apr 12 '10 at 20:09

As I think, the main idea behind not including the bigint support in modern processors is the desire to reduce ISA and leave as few instructions as possible, that are fetched, decoded and executed at full throttle. By the way, in x86 family processors there is a set of instructions that make writing big int library a single-day's matter. Another reason, I think, is price. It's much more efficient to save some space on the wafer dropping the redundant operations, that can be easily implemented on the higher level.

I think a processor that handles x86/x87/MMX/SSE/SSE2/SSE3/SSSE3 wasn't necessarily designed with the goal of minimizing the instruction set. — Jeffrey L Whitledge, Apr 12 '10 at 20:20

score 1 · Answer 6 · answered Aug 25 '15 at 00:06

Seems Intel is Adding (or has added as @ time of this post - 2015) new Instructions Support for Large Integer Arithmetic.

New instructions are being introduced on Intel® Architecture Processors to enable fast implementations of large integer arithmetic. Large Integer Arithmetic is widely used in multi-precision libraries for high-performance technical computing, as well as for public key cryptography (e.g., RSA). In this paper, we describe the critical operations required in large integer arithmetic and their efficient implementations using the new instructions.

http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/ia-large-integer-arithmetic-paper.html

That's the ADX instruction-set extensions ([`mulx`](http://www.felixcloutier.com/x86/MULX.html) / [`adcx`](http://www.felixcloutier.com/x86/ADCX.html) / [`adox`](http://www.felixcloutier.com/x86/ADOX.html)), added in Broadwell (indicated by the ADX feature bit in CPUID). Regular `adc` and `mul` already existed; these new instructions simply increase parallelism if you can keep two add-with-carry instruction-streams in flight at once, with multiplies interleaved. These instructions only use a single bit in EFLAGS for their carry in/out, without clobbering others. — Peter Cordes, Jun 17 '16 at 19:20

score 0 · Answer 7 · answered Feb 21 '11 at 12:49

There are so many instructions and functionalities jockeying for area on a CPU chip that in the end those that are used more often/deemed more useful will push out those that aren't. The instructions necessary for implementing BigInt functionality are there and the math is straight-forward.

score -1 · Answer 8 · answered Oct 13 '11 at 04:21

BigInt: the fundamental function required is: Unsigned Integer Multiplication, add previous high order I wrote one in Intel 16bit assembler, then 32 bit... C code is usually fast enough .. ie for BigInt you use a software library. CPUs (and GPUs) are not designed with unsigned Integer as top priority.

If you want to write your own BigInt...

Division is done via Knuths Vol 2 (its a bunch of multiply and subtract, with some tricky add-backs)

Add with carry and subtract are easier. etc etc

I just posted this in Intel: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx SSE4 is there a BigInt LIbrary?

i5 2410M processor I suppose can NOT use AVX [AVX is only on very recent Intel CPUs] but can use SSE4.2

Is there a BigInt Library for SSE? I Guess I am looking for something that implements unsigned integer

PMULUDQ (with 128-Bit operands) PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)

and does the carries.

Its a Laptop so I cant buy an NVIDIA GTX 550, which isnt so grand on unsigned Ints, I hear. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Why doesn't my processor have built-in BigInt support?

8 Answers8

The x86 instructions you're looking for are:

Most high-level languages (including C/C++) don't expose a "carry" flag

Here's an example of a BigInteger add function, in NASM syntax:

Linked