184

Rust has 128-bit integers, these are denoted with the data type i128 (and u128 for unsigned ints):

let a: i128 = 170141183460469231731687303715884105727;

How does Rust make these i128 values work on a 64-bit system; e.g. how does it do arithmetic on these?

Since, as far as I know, the value cannot fit in one register of a x86-64 CPU, does the compiler somehow use two registers for one i128 value? Or are they instead using some kind of big integer struct to represent them?

ruohola
  • 21,987
  • 6
  • 62
  • 97
  • 66
    exactly the same way 64-bit types are stored in 32-bit computers or 32-bit types are stored in 16-bit computers [Is it ok to use 64bit integers in a 32bit application?](https://stackoverflow.com/q/28977587/995714), [How is 64-bit math accomplished on a 32-bit machine?](https://stackoverflow.com/q/3190143/995714), [Do I need to have 64 bit Processor to use 64 bit data type](https://stackoverflow.com/q/5530906/995714), [128 bit integer with c](https://stackoverflow.com/q/6414605/995714), [How does a 32 bit processor support 64 bit integers?](https://stackoverflow.com/q/23038451/995714) – phuclv Aug 04 '19 at 03:13
  • 64
    How does a two-digit integer work when you only have 10 fingers? – Jörg W Mittag Aug 04 '19 at 07:00
  • 39
    @JorgWMittag: Ah - the old "two digit number with only ten fingers" ploy. Heh-heh. Thought you could fool me with that old one, eh? Well, my friend, as any second-grader could tell you - THAT is what the toes are for! ([With abject apologies to Peter Sellers...and Lady Lytton :-)](https://www.youtube.com/watch?v=O-LGISfFS4Y) – Bob Jarvis - Слава Україні Aug 04 '19 at 16:01
  • Just to state the obvious, if you talk 64Bit systems you mean address bus width, there might be bigger registers than that. Like the 512bit register extensions on x64. – eckes Aug 05 '19 at 03:34
  • If you think this is confusing, I wonder what you'll think of infinite precision number types... – Wassinger Aug 05 '19 at 03:48
  • 1
    FWIW most x86 machines have some special 128-bit or larger registers for SIMD operations. See https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions Edit: I somehow missed @eckes's comment – Ryan1729 Aug 05 '19 at 06:27
  • 1
    @phuclv You missed how "16-bit types are stored in 8-bit computers" :-) – Martin Bonner supports Monica Aug 05 '19 at 12:46
  • 2
    @eckes Although "8-bit computer" referred to the data bus - the address bus was usually larger, and also "16-bit computer" (although a 16-bit computer often had a 16-bit address bus too). – Martin Bonner supports Monica Aug 05 '19 at 12:48
  • 6
    @JörgWMittag Nah, computer scientists count in binary by lowering or extending individual fingers. And now, 132 y'all, I'm going home ;-D – Marco13 Aug 05 '19 at 13:30
  • 1
    @Ryan1729 SIMD is not supposed for 128-bit arithmetic. They can only operate on multiple smaller integers/floating-point values. See [practical BigNum AVX/SSE possible?](https://stackoverflow.com/q/27923192/995714), [Is there hardware support for 128bit integers in modern processors?](https://stackoverflow.com/q/34234407/995714) – phuclv Aug 06 '19 at 16:26
  • 2
    [How did 8 bit processors perform 16 bit arithmetic?](https://retrocomputing.stackexchange.com/questions/6640/how-did-8-bit-processors-perform-16-bit-arithmetic) – Reversed Engineer Aug 06 '19 at 16:39
  • 1
    @phuclv There are a limited amount of operations for 128 ints on many x86-64 machines, including addition and subtraction, as well as bitwise operations. This is mentioned in a comment on one of the answers you mentioned https://stackoverflow.com/questions/34234407#comment56228798_34239917 But admittedly, 128-bit multiplication, or anything more complicated than that does not seem to be supported even on newer processors. At least I couldn’t find a 128-bit mul on this page: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ – Ryan1729 Aug 07 '19 at 06:32
  • 1
    @Ryan1729 those are integer instructions and **not** SIMD instructions. Currently no architecture I know supports integer registers wider than 64 bits. SIMD is "Single instruction, **multiple data**" so they're used for processing multiple data at once and not quite useful for big int operations – phuclv Aug 07 '19 at 06:43
  • @phuclv It is true that SSE etc. are not designed for big int operations and I am struggling to find the actual instructions mentioned in the comment I linked to, unless they meant that it would not be that difficult to do the carry yourself after adding a pair of 64 bit ints with an SSE2 `_mm_add_epi64` instruction. But, technically, the SSE XMM registers are 128-bits wide, they just don't have the same set of operations available on them as "real" registers, and as you said, those operations are not particularly useful for operations on numbers larger than 64 bits. – Ryan1729 Aug 07 '19 at 07:18

4 Answers4

201

All Rust's integer types are compiled to LLVM integers. The LLVM abstract machine allows integers of any bit width from 1 to 2^23 - 1.* LLVM instructions typically work on integers of any size.

Obviously, there aren't many 8388607-bit architectures out there, so when the code is compiled to native machine code, LLVM has to decide how to implement it. The semantics of an abstract instruction like add are defined by LLVM itself. Typically, abstract instructions that have a single-instruction equivalent in native code will be compiled to that native instruction, while those that don't will be emulated, possibly with multiple native instructions. mcarton's answer demonstrates how LLVM compiles both native and emulated instructions.

(This doesn't only apply to integers that are larger than the native machine can support, but also to those that are smaller. For example, modern architectures might not support native 8-bit arithmetic, so an add instruction on two i8s may be emulated with a wider instruction, the extra bits discarded.)

Does the compiler somehow use 2 registers for one i128 value? Or are they using some kind of big integer struct to represent them?

At the level of LLVM IR, the answer is neither: i128 fits in a single register, just like every other single-valued type. On the other hand, once translated to machine code, there isn't really a difference between the two, because structs may be decomposed into registers just like integers. When doing arithmetic, though, it's a pretty safe bet that LLVM will just load the whole thing into two registers.


* However, not all LLVM backends are created equal. This answer relates to x86-64. I understand that backend support for sizes larger than 128 and non-powers of two is spotty (which may partly explain why Rust only exposes 8-, 16-, 32-, 64-, and 128-bit integers). According to est31 on Reddit, rustc implements 128 bit integers in software when targeting a backend that doesn't support them natively.

trent
  • 25,033
  • 7
  • 51
  • 90
  • 2
    Huh, I wonder why it's 2^23 instead of the more typical 2^32 (well, speaking broadly in terms of how often those numbers appear, not in terms of maximum bit widths of integers supported by compiler backends...) – Nic Aug 04 '19 at 14:17
  • 31
    @NicHartley Some of LLVM's baseclasses have a field where subclasses can store data. For the `Type` class this means there are 8 bits to store what kind of type it is (function, block, integer, ...) and 24 bits for subclass data. The `IntegerType` class then uses those 24 bits to store the size, allowing instances to neatly fit in 32 bits! – Todd Sewell Aug 04 '19 at 21:01
74

The compiler will store these in multiple registers and use multiple instructions to do arithmetic on those values if needed. Most ISAs have an add-with-carry instruction like x86's adc which makes it fairly efficient to do extended-precision integer add/sub.

For example, given

fn main() {
    let a = 42u128;
    let b = a + 1337;
}

the compiler generates the following when compiling for x86-64 without optimization:
(comments added by @PeterCordes)

playground::main:
    sub rsp, 56
    mov qword ptr [rsp + 32], 0
    mov qword ptr [rsp + 24], 42         # store 128-bit 0:42 on the stack
                                         # little-endian = low half at lower address

    mov rax, qword ptr [rsp + 24]
    mov rcx, qword ptr [rsp + 32]        # reload it to registers

    add rax, 1337                        # add 1337 to the low half
    adc rcx, 0                           # propagate carry to the high half. 1337u128 >> 64 = 0

    setb    dl                           # save carry-out (setb is an alias for setc)
    mov rsi, rax
    test    dl, 1                        # check carry-out (to detect overflow)
    mov qword ptr [rsp + 16], rax        # store the low half result
    mov qword ptr [rsp + 8], rsi         # store another copy of the low half
    mov qword ptr [rsp], rcx             # store the high half
                             # These are temporary copies of the halves; probably the high half at lower address isn't intentional
    jne .LBB8_2                       # jump if 128-bit add overflowed (to another not-shown block of code after the ret, I think)

    mov rax, qword ptr [rsp + 16]
    mov qword ptr [rsp + 40], rax     # copy low half to RSP+40
    mov rcx, qword ptr [rsp]
    mov qword ptr [rsp + 48], rcx     # copy high half to RSP+48
                  # This is the actual b, in normal little-endian order, forming a u128 at RSP+40
    add rsp, 56
    ret                               # with retval in EAX/RAX = low half result

where you can see that the value 42 is stored in rax and rcx.

(editor's note: x86-64 C calling conventions return 128-bit integers in RDX:RAX. But this main doesn't return a value at all. All the redundant copying is purely from disabling optimization, and that Rust actually checks for overflow in debug mode.)

For comparison, here is the asm for Rust 64-bit integers on x86-64 where no add-with-carry is needed, just a single register or stack-slot for each value.

playground::main:
    sub rsp, 24
    mov qword ptr [rsp + 8], 42           # store
    mov rax, qword ptr [rsp + 8]          # reload
    add rax, 1337                         # add
    setb    cl
    test    cl, 1                         # check for carry-out (overflow)
    mov qword ptr [rsp], rax              # store the result
    jne .LBB8_2                           # branch on non-zero carry-out

    mov rax, qword ptr [rsp]              # reload the result
    mov qword ptr [rsp + 16], rax         # and copy it (to b)
    add rsp, 24
    ret

.LBB8_2:
    call panic function because of integer overflow

The setb / test is still totally redundant: jc (jump if CF=1) would work just fine.

With optimization enabled, the Rust compiler doesn't check for overflow so + works like .wrapping_add().

trent
  • 25,033
  • 7
  • 51
  • 90
mcarton
  • 27,633
  • 5
  • 85
  • 95
  • Is the top example 128 bit addition on a 32 bit machine? – Simd Aug 04 '19 at 07:17
  • 5
    @Anush No, rax/rsp/... are 64-bit registers. Each 128-bit number is stored in two registers/memory locations, which results in the two 64-bit additions. – ManfP Aug 04 '19 at 13:49
  • 6
    @Anush: no, it's just using so many instructions because it's compiled with optimization disabled. You'd see *much* simpler code (like just the add/adc) if you compiled a function that took two `u128` args and returned a value (like this https://godbolt.org/z/6JBza0), instead of disabling optimization to stop the compiler from doing constant-propagation on compile-time-constant args. – Peter Cordes Aug 04 '19 at 20:12
  • Cheers :) I'd read that the LLVM-IR Rustcc sends to the optimizer is sometimes over-complicated and in need of a lot of optimization. I think this is an example, and the extra copying in the un-optimized x86-64 asm are probably a direct result of that. :/ (specifically, storing the high/low half to tmp locations before copying them to `b`, and the extra copy of the low half to another register and then to another stack slot.) – Peter Cordes Aug 04 '19 at 20:25
  • Actually, I'm 99% sure that optimizations in release mode have to allow for wrapping. It'd be UB to overflow elsewise, which is not allowed in safe code. With use of iterators over C style loops, though, the main reason for overflow UB in C/C++ is no longer present, anyway. – CAD97 Aug 05 '19 at 03:00
  • 3
    @CAD97 Release mode *uses* wrapping arithmetic but does not check for overflow and panic like debug mode does. This behavior was defined by [RFC 560](https://github.com/rust-lang/rfcs/pull/560). It's not UB. – trent Aug 05 '19 at 11:47
  • @trentcl: thanks; I'm casually interested in Rust but don't use it regularly. I was assuming that optimized Rust math was like C/C++ signed math, so that wrong claim in this answer is from my edit. I guess it should just be removed if there's no need to go into any detail about it here? (Am I understanding correctly that making regular `+` detect overflow in debug mode is only useful for catching unexpected integer-wrapping bugs, not for allowing any optimization based on overflow not happening?) – Peter Cordes Aug 05 '19 at 22:02
  • @PeterCordes What wrong claim? ;-) Yes, as far as I understand it, Rust doesn't allow that particular class of integer optimizations (precisely because it would lead to UB). – trent Aug 05 '19 at 22:14
  • 3
    @PeterCordes: Specifically, Rust the language specifies that overflow is unspecified, and rustc (the only compiler) specifies two behaviors to choose from: Panic or Wrap. Ideally, Panic would be used by default. In practice, due to sub-optimal code-generation, in Release mode the default is Wrap, and a long-term goal is to move to Panic when (if ever) code-generation is "good enough" for mainstream use. Also, all Rust integral types support named operations to pick a behavior: checked, wrapping, saturating, ... so you can override the selected behavior on a per operation basis. – Matthieu M. Aug 06 '19 at 10:49
  • 1
    @MatthieuM.: Yes, I love the wrapping vs. checked vs. saturating add/sub/shift/whatever methods on primitive types. So much better than C's wrapping unsigned, UB signed forcing you to pick based on that. Anyway, some ISAs could provide efficient support for Panic, e.g. a sticky flag you can check after a whole sequence of operations. (Unlike x86's OF or CF which are overwritten with 0 or 1.) e.g. Agner Fog's proposed ForwardCom ISA (https://www.agner.org/optimize/blog/read.php?i=421#478) But that still constrains optimization to never do any calculation the Rust source didn't do. :/ – Peter Cordes Aug 06 '19 at 11:03
  • @MatthieuM.: So I think long term you're always going to want an option to not actually enforce Panic because for most operations on most ISAs there's no efficient way to check. (Exceptions include MIPS's faulting-`add` instruction, `add` instead of `addu`, that faults on signed overflow). Some future x86 extension could add a "sticky overflow" flag in one of the otherwise-reserved bits of EFLAGS, and an instruction to clear it or branch on it being set. (Or just to read it or something.) Otherwise on x86 using `jo` or `into` after every add/sub/shift is going to suck. And no SIMD. – Peter Cordes Aug 06 '19 at 11:09
  • 1
    @PeterCordes: The best ISA I've seen for this is the "vaporware" Mill ISA which poisons the value and carries on, both in scalar and SIMD. This means you just perform the whole chain of operations on your vector, and at the end you can check each element to see if any is poisoned. And of course, it also allows speculative computations. At the moment, though, I'm afraid the first hurdle is at the compiler level; panic on under/overflow prevents reordering: `5 + x - 3` does not have the same domain as `x - 3 + 5`, you lose commutativity! – Matthieu M. Aug 06 '19 at 11:27
  • @MatthieuM.: I wonder if there's some way to define integer overflow that sidesteps that, like only if the result of an entire expression differs from the non-wrapped result. Otherwise programmers are going to have to learn that integer operations are non-commutative, just like FP operations, so putting constants together is important. (And I guess first if they have different signs, otherwise you still have to do `x - 3 + 5` in 2 steps instead of `x + 2` if you insist on panic on x<3.) If overflow is only checked per-expression, you could choose `tmp=x-3 ; tmp + 5` to request both checks. – Peter Cordes Aug 06 '19 at 11:38
  • @PeterCordes: It seems like you'd need some sign of "overflow counter" which is incremented on overflow, decremented on underflow, and should be 0 when you wish to use the number... but even it's weird... – Matthieu M. Aug 06 '19 at 12:40
34

Yes, just the same way as 64-bit integers on 32-bit machines were handled, or 32-bit integers on 16-bit machines, or even 16- and 32-bit integers on 8-bit machines (still applicable to microcontrollers!). Yes, you store the number in two registers, or memory locations, or whatever (it doesn't really matter). Addition and subtraction are trivial, taking two instructions and using the carry flag. Multiplication requires three multiplies and some additions (it's common for 64-bit chips to already have a 64x64->128 multiply operation that outputs to two registers). Division... requires a subroutine and is quite slow (except in some cases where division by a constant can be transformed into a shift or a multiply), but it still works. Bitwise and/or/xor merely have to be done on the top and bottom halves separately. Shifts can be accomplished with rotation and masking. And that pretty much covers things.

hobbs
  • 223,387
  • 19
  • 210
  • 288
31

To provide perhaps a clearer example, on x86_64, compiled with the -O flag, the function

pub fn leet(a : i128) -> i128 {
    a + 1337
}

compiles to

example::leet:
  mov rdx, rsi
  mov rax, rdi
  add rax, 1337
  adc rdx, 0
  ret

(My original post had u128 rather than the i128 you asked about. The function compiles the same code either way, a good demonstration that signed and unsigned addition are the same on a modern CPU.)

The other listing produced unoptimized code. It’s safe to step through in a debugger, because it makes sure you can put a breakpoint anywhere and inspect the state of any variable at any line of the program. It’s slower and harder to read. The optimized version is much closer to the code that will actually run in production.

The parameter a of this function is passed in a pair of 64-bit registers, rsi:rdi. The result is returned in another pair of registers, rdx:rax. The first two lines of code initialize the sum to a.

The third line adds 1337 to the low word of the input. If this overflows, it carries the 1 in the CPU’s carry flag. The fourth line adds zero to the high word of the input—plus the 1 if it got carried.

You can think of this as simple addition of a one-digit number to a two-digit number

  a  b
+ 0  7
______
 

but in base 18,446,744,073,709,551,616. You’re still adding the lowest “digit” first, possibly carrying a 1 to the next column, then adding the next digit plus the carry. Subtraction is very similar.

Multiplication must use the identity (2⁶⁴a + b)(2⁶⁴c + d) = 2¹²⁸ac + 2⁶⁴(ad+bc) + bd, where each of these multiplications returns the upper half of the product in one register and the lower half of the product in another. Some of those terms will be dropped, because bits above the 128th don’t fit into a u128 and are discarded. Even so, this takes a number of machine instructions. Division also takes several steps. For a signed value, multiplication and division would additionally need to convert the signs of the operands and the result. Those operations are not very efficient at all.

On other architectures, it gets easier or harder. RISC-V defines a 128-bit instruction-set extension, although to my knowledge no one has implemented it in silicon. Without this extension, the RISC-V architecture manual recommends a conditional branch: addi t0, t1, +imm; blt t0, t1, overflow

SPARC has control codes like the control flags of x86, but you have to use a special instruction, add,cc, to set them. MIPS, on the other hand, requires you to check whether the sum of two unsigned integers is strictly less than one of the operands. If so, the addition overflowed. At least you’re able to set another register to the value of the carry bit without a conditional branch.

Davislor
  • 14,674
  • 2
  • 34
  • 49
  • 1
    last paragraph: To detect which of two *unsigned* numbers is greater by looking at the high bit of a `sub` result, you need an `n+1` bit sub result for `n` bit inputs. i.e. you need to look at the carry-out, not the sign bit of the same-width result. That's why x86 unsigned branch conditions are based on CF (bit 64 or 32 of the full logical result), not SF (bit 63 or 31). – Peter Cordes Aug 27 '19 at 16:01
  • @PeterCordes Thanks again for noticing the mistake so promptly! I’m grateful. Corrected. – Davislor Aug 27 '19 at 21:59
  • I recently wrote an answer about designing a toy CPU and using its ALU to do signed comparison. So I noticed it easily when looking at the diff of your edit. (I tend to look at most edits in the `[assembly]` and `[x86]` and some other tags I watch since they're low-traffic enough for me to not go insane doing that.) But I can't find that answer anymore with search; possibly it got deleted :( – Peter Cordes Aug 28 '19 at 06:20
  • @PeterCordes I was actually thinking about the same problem myself recently, which is why that slip-up was so embarrassing. (The 128-bit integers are signed, but the lower 64 bits must be treated as unsigned here.) My understanding is that adding state like a carry flag would complicate the pipeline of an out-of-order implementation, and RISC ISAs have been moving away from adding dependencies on flags, but I'd been wondering whether a requirement that the instructions generating and using the carry be consecutive would ameliorate that. I’m not aware of any ISAs that use such a solution. – Davislor Aug 28 '19 at 06:42
  • @PeterCordes Some modern architectures do fuse consecutive div and mod instructions in a similar way, rather than use a visible register such as `%hi` on MIPS or `%y` on SPARC. – Davislor Aug 28 '19 at 06:45
  • Yes exactly. In an extended-precision integer, only the top chunk contains the sign bit. For 2's complement, the rest all have standard power-of-2 place-values. 2's complement addition/subtraction is the same binary operation as for unsigned, including software extended-precision implementations. – Peter Cordes Aug 28 '19 at 06:48
  • @PeterCordes Yes, exactly. – Davislor Aug 28 '19 at 06:49
  • 1
    re: divmod: AArch64's approach is to provide division and an instruction that does integer `x - (a*b)`, computing the remainder from the dividend, quotient, and divisor. (That is useful even for constant divisors using a multiplicative inverse for the division part). I hadn't read about ISAs that fuse div+mod instructions into a single divmod operation; that's neat. – Peter Cordes Aug 28 '19 at 06:51
  • 1
    re: flags: yes, a flag output is a 2nd output that OoO exec + register-renaming has to handle somehow. x86 CPUs handle it by keeping a few extra bits with the integer result that FLAGS value is based on, so probably ZF, SF, and PF are generated on the fly when needed. I think there's an Intel patent about this. So that reduces the number of outputs that have to be tracked separately back to 1. (In Intel CPUs, no uop can ever write more than 1 integer register; e.g. `mul r64` is 2 uops, with the 2nd one writing the RDX high half). – Peter Cordes Aug 28 '19 at 06:55
  • 2
    But for efficient extended-precision, flags are very good. The main problem is *without* register renaming for superscalar in-order execution. flags are a WAW hazard (write after write). Of course, add-with-carry instructions are 3-input, and that's also a significant problem to track. Intel before Broadwell decoded `adc`, `sbb`, and `cmov` to 2 uops each. (Haswell introduced 3-input uops for FMA, Broadwell extended that to integer.) – Peter Cordes Aug 28 '19 at 07:00
  • 2
    RISC ISAs with flags usually make flag-setting optional, controlled by an extra bit. e.g. ARM and SPARC are like this. PowerPC as usual makes everything more complicated: it has 8 condition-code registers (packed together into one 32-bit register for save/restore) so you can compare into cc0 or into cc7 or whatever. And then AND or OR condition-codes together! Branch and cmov instructions can choose which CR register to read. So this gives you the ability to have multiple flags dep chains in flight at once, like x86 ADCX / ADOX. http://alanclements.org/power%20pc.html – Peter Cordes Aug 28 '19 at 07:03