24

Do we still need to emulate 128bit integers in software, or is there hardware support for them in your average desktop processor these days?

Z boson
  • 32,619
  • 11
  • 123
  • 226
Filip Haglund
  • 13,919
  • 13
  • 64
  • 113
  • I added he x86 tag since that's the most common desktop instruction set. Your question would have broader appeal if you include tablets and phones as well which mostly use ARM. I doubt ARM has 64-bit * 64-bit to 128-bit instructions. – Z boson Dec 12 '15 at 12:43

5 Answers5

19

The x86-64 instruction set can do 64-bit*64-bit to 128-bit using one instruction (mul for unsigned imul for signed each with one operand) so I would argue that to some degree that the x86 instruction set does include some support for 128-bit integers.

If your instruction set does not have an instruction to do 64-bit*64-bit to 128-bit then you need several instructions to emulate this.

This is why 128-bit * 128-bit to lower 128-bit operations can be done with few instructions with x86-64. For example with GCC

__int128 mul(__int128 a, __int128 b) {
    return a*b;
}

produces this assembly

imulq   %rdx, %rsi
movq    %rdi, %rax
imulq   %rdi, %rcx
mulq    %rdx
addq    %rsi, %rcx
addq    %rcx, %rdx

which uses one 64-bit * 64-bit to 128-bit instructions, two 64-bit * 64-bit to lower 64-bit instructions, and two 64-bit additions.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 2
    @Filip: There's also `adc` / `sbb` (add with carry, subtract with borrow), and div/idiv (128b/64b -> 64b dividend and 64b remainder). So adding/subtracting 128b integers only takes two instructions (not counting data movement). Bitwise boolean stuff can be done in SSE vectors. – Peter Cordes Dec 12 '15 at 18:43
  • @PeterCordes, good points. I did not think of `adc` as 128-bit support but I see your point. Some instruction sets do not have `adc` (e.g. SSE/AVX) and so they requires more instructions to do multi-word additions. – Z boson Dec 12 '15 at 19:04
17

I'm going to explain it by comparing the desktop processors to simple microcontrollers because of the similar operation of the arithmetic logic units (ALU), which are the calculators in the CPU, and the Microsoft x64 Calling Convention vs the System-V Calling Convention. For the short answer scroll to the end, but the long answer is that it's easiest to see the difference by comparing the x86/x64 to ARM and AVR:

Long Answer

Native Double Word Integer Multiply Architecture Support Comparison

CPU word x word => dword dword x dword => dword
M0 No (only 32x32 => 32) No
AVR 8x8 => 16 (some versions only) No
M3/M4/A Yes (32x32 => 64) No
x86/x64 Yes (up to 64x64 => 128) Yes (up to 64x64 => 64 for x64)
SSE/SSE2/AVX/AVX2 Yes (32x32 => 64 SIMD elements) No (at most 32x32 => 32 SIMD elements)

If you understand this chart, skip to Short Answer

CPUs in smartphones, PCs, and Servers have multiple ALUs that perform calculations on registers of various widths. Microcontrollers on the other hand usually only have one ALU. The word-size of the CPU is not the same as the word size of the ALU, though they may be the same, the Cortex-M0 being a prime example.

ARM Architecture

The Cortex-M0 is a Thumb-2 processor, which is a compact (mostly 16-bit) instruction encoding for a 32-bit architecture. (Registers and ALU width). Cortex-M3/M4 have some more instructions, including smull / umull, 32x32 => 64-bit widening multiply that are helpful for extended precision. Despite these differences, all ARM CPUs share the same set of architectural registers, which is easy to upgrade from M0 to M3/M4 and faster Cortex-A series smartphone processors with NEON SIMD extensions.

ARM Architectural Registers

ARM Architecture

When performing a binary operation, it is common for the value to overflow a register (i.e. get too large to fit in the register). ALUs have n-bits input and n-bits output with a carryout (i.e. overflow) flag.

enter image description here

Addition cannot be performed in one instruction but requires relatively few instructions. However, for multiplication you will need to double the word size to fit the result and the ALU only has n inputs and n outputs when you need 2n outputs so that wouldn't work. For example, by multiplying two 32-bit integers you need a 64-bit result and two 64-bit integers require up to a 128-bit result with 4 word-sized registers; 2 is not bad, but 4 gets complicated and you run out of registers. The way the CPU handles this is going to be different. For the Cortex-M0 there are no instructions for that but with the Cortex-M3/M4 there is an instruction for 32x32=>64-bit register multiply that takes 3 clock cycles.

(You can use Cortex-M0's 32x32 => 32-bit muls as a 16x16=>32-bit building block for larger multiplies; this is obviously inefficient but probably still better than manually shifting and conditionally adding.)

AVR Architecture

The AVR microcontroller has 131 instructions that work on 32 8-bit registers and is classified as an 8-bit processor by its register width but it has both an 8-bit and a 16-bit ALU. The AVR processor cannot do 16x16=>32-bit calculations with two 16-bit register pairs or 64-bit integer math without a software hack. This is the opposite of the x86/x64 design in both organizations of registers and ALU overflow operation. This is why AVR is classified as an 8/16-bit CPU. Why do you care? It affects performance and interrupt behavior.

AVR "tiny", and other devices without the "enhanced" instruction-set don't have hardware multiply at all. But if supported at all, the mul instruction is 8x8 => 16-bit hardware multiply. https://godbolt.org/z/7bbqKn7Go shows how GCC uses it.

AVR Architectural Registers

enter image description here

x86 Architecture

On x86, multiplying two 32-bit integers to create a 64-bit integer can be done with the the MUL instruction resulting in a unsigned 64-bit in EDX:EAX, or 128-bit result in RDX:RAX pair.

Adding 64-bit integers on x86 requires only two instructions (add/adc thanks to the carry flag), same for 128-bit on x86-64. But multiplying two-register integers takes more work.

On 32-bit x86 for example, 64x64 => 64-bit multiplication (long long) requires A LOT of instructions, including 3 multiplies (with the low x low widening, the cross products not, because we don't need the high 64 bits of the full result). Here is an example of 32x64=>64-bit x86 signed multiply assembly for x86:

 movl 16(%ebp), %esi    ; get y_l
 movl 12(%ebp), %eax    ; get x_l
 movl %eax, %edx
 sarl $31, %edx         ; get x_h, (x >>a 31), higher 32 bits of sign-extension of x
 movl 20(%ebp), %ecx    ; get y_h
 imull %eax, %ecx       ; compute s: x_l*y_h
 movl %edx, %ebx
 imull %esi, %ebx       ; compute t: x_h*y_l
 addl %ebx, %ecx        ; compute s + t
 mull %esi              ; compute u: x_l*y_l
 leal (%ecx,%edx), %edx ; u_h += (s + t), result is u
 movl 8(%ebp), %ecx
 movl %eax, (%ecx)
 movl %edx, 4(%ecx)

x86 supports pairing up two registers to store the full multiply result (including the high-half), but you can't use the two registers to perform the task of a 64-bit ALU. This is the primary reason why x64 software runs faster than x86 software for 64-bit or wider integer math: you can do the work in a single instruction! You could imagine that 128-bit multiplication in x86 mode would be very computationally expensive, it is. The x64 is very similar to x86 except with twice the number of bits.

x86 Architectural Registers

x86 Architectural Registers

x64 Architectural Registers

x64 Architectural Registers

When CPUs pair 2 word-sized registers to create a single double word-sized value, On the stack the resulting double word value will be aligned to a word boundary in RAM. Beyond the two register pair, four-word math is a software hack. This means that for x64 two 64-bit registers may be combined to create a 128-bit register pair overflow that gets aligned to a 64-bit word boundary in RAM, but 128x128=>128-bit math is a software hack.

The x86/x64, however, is a superscalar CPU, and the registers you know of are merely the architectural registers. Behind the scenes, there are a lot more registers that help optimize the CPU pipeline to perform out of order instructions using multiple ALUs.

SSE/SSE2 introduced 128-bit SIMD registers, but no instructions treat them as a single wide integer. There's paddq that does two 64-bit additions in parallel, but no hardware support for 128-bit addition, or even support for manually propagating carry across elements. The widest multiply is two 32x32=>64 operations in parallel, half the width of what you can do with x86-64 scalar mul. See Can long integer routines benefit from SSE? for the state of the art, and the hoops you have to jump through to get any benefit from SSE/AVX for very big integers.

Even with AVX-512 (for 512-bit registers), the widest add / mul instructions are still 64-bit elements. x86-64 did introduce 64x64 => 64-bit multiply in SIMD elements.

Short Answer

The way that C++ applications will handle 128-bit integers will differ based on the Operating System or bare metal calling a convention. Microsoft has their own convention that, much to my own dismay, the resulting 128-bit return value CAN NOT be returned from a function as a single value. The Microsoft x64 Calling Convention dictates that when returning a value, you may return one 64-bit integer or two 32-bit integers. For example, you can do word * word = dword, but in Visual-C++ you must use _umul128 to return the HighProduct, regardless of it being in the RDX:RAX pair. I cried, it was sad. :-(

The System-V calling convention, however, does allow for returning 128-bit return types in RAX:RDX. https://godbolt.org/z/vdd8rK38e. (And GCC / clang have __int128 to get the compiler to emit the necessary instructions to 2-register add/sub/mul, and helper function for div - Is there a 128 bit integer in gcc?)

As for whether you should count on 128-bit integer support, it's extremely rare to come across a user using a 32-bit x86 CPU because they are too slow so it is not best practice to design software to run on 32-bit x86 CPUs because it increases development costs and may lead to a degraded user experience; expect an Athlon 64 or Core 2 Duo to the minimum spec. You can expect the code to not perform as well on Microsoft as Unix OS(s).

The Intel architecture registers are set in stone, but Intel and AMD are constantly rolling out new architecture extensions but compilers and apps take a long time to update you can't count on it for cross-platform. You'll want to read the Intel 64 and IA-32 Architecture Software Developer’s Manual and AMD64 Programmers Manual.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • You're looking at the Windows x64 convention. x86-64 System V returns 128-bit integer values (including structs) in the `rdx:rax` register pair. https://godbolt.org/g/vhJyKE. – Peter Cordes Jun 09 '18 at 20:44
  • 1
    Thumb2 is a compact encoding for the the same instructions you can use in ARM mode, so it's a 32-bit ISA. Most instructions can only use 8 registers, and only have 2 operands, not a separate destination; that's how they save space in the instruction encoding. e.g. `adds r0, r1` only takes 2 bytes of machine code, but operates on 32-bit registers. Your suggestion that it takes two 16-bit instructions to deal with 32-bit registers is not how Thumb mode works. You're mixing up the operand-size with the instruction width. – Peter Cordes Jun 09 '18 at 20:49
  • I've been programming Thumb2 code for 8 years at this point. Trust me, when it compiles it is 16-bit instructions. They don't have pure 32-bit instructions. The Cortex M3 does have real 32-bit instructions. –  Jun 09 '18 at 20:57
  • The x86 and x64 calling conventions are from AMD, not Microsoft. You can make whatever calling convention you want to support 128 or 256 bits, but the convention is set up the way it is for a reason. My answer is correct. –  Jun 09 '18 at 21:04
  • 2
    Microsoft clearly designed their own x64 calling convention. GCC developers (especially Jan Hubicka) designed the x86-64 System V convention which Linux, OS X, *BSD and other non-Windows OSes use. [Why does Windows64 use a different calling convention from all other OSes on x86-64?](https://stackoverflow.com/q/4429398). AMD itself didn't design either 64-bit calling convention. – Peter Cordes Jun 09 '18 at 21:15
  • The ARM calling convention was designed by ARM, the x86 calling convention was designed by Intel, and the x64 calling convention was designed by AMD. This is why Stack Overflow pisses me off. People don't know what they are talking about and mark you down when they are in fact wrong. Microsoft didn't invent the CPU architecture calling a convention. Y'all are working with the SSE/SSE2 instructions, not the x64 calling convention. You don't know what you're talking about. –  Jun 09 '18 at 21:21
  • 1
    My answer on [Why does Windows64 use a different calling convention from all other OSes on x86-64?](https://stackoverflow.com/q/4429398) has links to the mailing list archive threads where the x86-64 System V ABI's calling-convention details were designed. The GCC developer who took the lead on that was/is not an AMD employee (http://www.ucw.cz/~hubicka/). Do you have a source for your claim that AMD designed the Windows x64 calling convention? – Peter Cordes Jun 09 '18 at 21:26
  • I don't know the history on i386 calling conventions, but there is no single 32-bit calling convention. Different compilers invented their own. The claim that Intel designed *the* x86 calling convention is just completely bogus for multiple reasons. – Peter Cordes Jun 09 '18 at 21:28
  • Sorry I was not correct. On SysV you can have 128-bit return values in RAX and RDX. I'm on Windows too much. I will update my answer. –  Jun 10 '18 at 00:45
  • I added some assembly to prove my claims. I did speak though about AVR. AVR cannot do 16x16=>32-bit multiplication. x64 can do 64x64=>128-bit though, but not 128x128=>128-bit. –  Jun 13 '18 at 06:36
  • The opening paragraphs about instruction width are a total red herring. The fact that ARM Thumb-2 encodes 32x32=>64 multiplication as a 32-bit instruction is totally irrelevant to how many ALU operations it takes, or any kind of big-picture understanding of 128-bit integer support. The things that matter are register width and what operations are supported in a single machine instruction. (i.e. hardware support, whether it breaks down into multiple internal operations or not, like 64x64 integer multiply on Intel Atom.) – Peter Cordes Jun 13 '18 at 08:20
  • No, it is not. They all operate the same way. IT's the method the ALU uses to handle the creation of hashes. I'm a computer engineer and I design this stuff. The end story is that 128-bit integer support is not fully supported. I just laid out the engineering for you in detail, if you don't understand it you don't know what your talking about. –  Jun 13 '18 at 17:42
  • Do you want me to sit here and write you a book about it? We're up to the better part of a chapter already. –  Jun 13 '18 at 17:42
  • Actually, come to think about it, the reason why the Microsoft C++ Calling Convention does support 128-bit return types is more than likely for the exact reasons that I detailed. You CPU fully supports 128-bit integers use CPU, but this is not an x64 architecture feature. Where you will see this happen is with Python and other cross-platform tools. You won't have AVX acceleration because it's not a standard CPU architecture feature on AMD. It's all compiler dependent. I've been wrestling with this problem for a number of years with my own software-defined networking protocol Script. –  Jun 13 '18 at 17:52
  • I added a comparison chart that clearly explains the entire answer by comparing the integer ALU operation in a matrix. –  Jun 13 '18 at 18:23
  • 1
    If the `x` in your chart is supposed to be multiply, then x86-64 with SSE/AVX/AVX2/AVX512 do not have 64x64 => 128-bit multiply. AVX512 has 64x64 => 64 packed multiply ([`vpmullq`](http://felixcloutier.com/x86/PMULLD:PMULLQ.html)), but AVX2's widest integer multiply is 32x32 => 64 elements. (Or you can use hacks like doing 52-bit integer multiplies using the `double`-precision FMA instructions: [Can I use the AVX FMA units to do bit-exact 52 bit integer multiplications?](https://stackoverflow.com/q/41403718). I think Prime95 does something like that.) – Peter Cordes Jun 14 '18 at 01:03
  • Sorry I was rude, I have PTSD and didn't know it. –  Jul 31 '18 at 17:54
  • My previous comment about your chart still applies. x86-64 can do 64x64=>64-bit scalar, or 64x64=>128-bit scalar (with about the same throughput), but with SIMD up to AVX2 can only do 32x32=>64-bit. AVX512 introduces 64x64=>64-bit integer multiply, but it's still narrower than what you can do scalar. If "dword" is supposed to mean twice the register width, then only 32-bit x86 with AVX512 has dword x dword => dword. Or if a word is always 32 bits for x86, then x86-64 has scalar dword x dword => dword thanks to having double-width integer regs. – Peter Cordes Jul 31 '18 at 23:13
  • Now that the original author of this answer has deleted their account, I guess it's time to fix all the wrong stuff here, like the wrong claims that SSE2 gives you 128-bit integer math! And to put numbers in the table because the definition of "word" varies by ISA. Also the claim that Cortex-M0 is 16-bit. It doesn't have widening multiply, so for BigInteger work you'd need to use 16-bit chunks (or give up on hardware `muls` entirely), but that doesn't make it a 16-bit CPU! – Peter Cordes May 13 '21 at 00:16
4

Short answer is: NO!

To elaborate more, the SSE registers are 128-bit wide, but no instructions exist to treat them as 128-bit-integers. At best, these registers are treated as two 64-bit (un)signed integers. Operations like addition/... can be constructed by parallely adding these two 64-bit-values and manually handling overflow, but not with a single instruction. Implementing this can get quite complicated and "ugly", look here:

How can I add together two SSE registers

This would have to be done for every basic operation with probably questionable advantages compared to an implemention with 64-bit general purpose registers ("emulation" in software). On the other hand, an advantage of this SSE-approach would be that once it is implemented, it will also work for 256-bit-integers(AVX2) and 512-bit-integers(AVX-512) with only minor modifications.

Community
  • 1
  • 1
zx485
  • 28,498
  • 28
  • 50
  • 59
  • Arguably all the bitwise operations allow treating them as 128-bit integers (or really any combination of whatever width of integers you'd like). – BeeOnRope Jun 09 '18 at 21:28
  • [Can long integer routines benefit from SSE?](https://stackoverflow.com/q/8866973/995714), [practical BigNum AVX/SSE possible?](https://stackoverflow.com/q/27923192/995714) – phuclv Jul 19 '23 at 05:09
  • Thanks for adding the reference. Some time has passed since 2015. – zx485 Jul 24 '23 at 03:05
4

After 5 years; the answer to this question is still "No".

To be specific, let's break it into various operations for 80x86:

Integer Addition

No support for 128-bit. There has always been support for "bigger than supported natively" integer operations (e.g. add then adc). A few years ago support for "bigger than supported natively" integer operations was improved by the introduction of Intel ADX (Multi-Precision Add-Carry Instruction Extensions) so that it can be done while preserving other flags (which could be important in loops - e.g. where other flags control an exit condition).

Integer Subtraction

No support for 128-bit. There has always been support for "bigger than supported natively" integer operations (e.g. sub then sbb). This hasn't changed (Intel's ADX extension doesn't cover subtraction).

Integer Multiplication

No support for 128-bit (multiplying a 128-bit integer with a 128-bit integer). There has always been support for "bigger than supported natively" integer operations (e.g. multiplying 64-bit integers and getting a 128 bit result).

Integer Division

No support for 128-bit (dividing a 128-bit integer by a 128-bit integer). There has always been partial support for "bigger than supported natively" integer operations (e.g. dividing a 128-bit integer by a 64-bit integer and getting 64 bit results) which doesn't help when the divisor is 128-bit.

Integer Shifts

No support for 128-bit. There has always been support for "bigger than supported natively" integer operations (e.g. shld and shrd, and rcr and rcl).

Atomics

Mostly no support for 128-bit. There is a single lock cmpxchg16b instruction (introduced soon after the introduction of long mode) that can be used to emulate a 128-bit "atomic load" in a single instruction or a 128-bit "atomic compare if equal" in a single instruction ("atomic store" and "atomic exchange" would require a retry loop). Note: Aligned SSE and AVX loads/stores aren't guaranteed to be atomic (in practice, they might or might not be "pseudo atomic" for simple cases where all reads and writes are the same size).

Bitwise Operations (AND, OR, XOR)

For general purpose registers, no. For SIMD, 128-bit has been supported since SSE2 was introduced in 2000 (since before long mode and 64-bit general purpose registers were supported); but it'd be a rare scenario where that's useful (e.g. where you're not doing a mixture of 128-bit operations and can avoid the cost of moving the values to/from SSE registers).

Bitfield/Bit String Operations (setting, clearing and testing individual bits in a bitfield)

"Partial support sort of". If the bitfield is in register/s, 128-bit bitfields aren't supported. If the bitfield is in memory then 80x86 has supported significantly larger sizes for ages (up to 65536-bit bitfields in 16-bit code, up to 4294967296-bit bitfields in 32-bit code, etc). This includes atomic operations on bitfields (lock bts .., etc).

Addressing

No support (for physical addresses or virtual addresses). We still don't even have full 64-bit addresses for either. There is a "5 level paging" extension to increase virtual addresses from 48-bit to 57-bit, but it's hard to be enthusiastic (due to the "usefulness vs. overhead" compromise).

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • BTS requires 386 so `bts [di], eax` with 32-bit operand-size is always possible, even in 16-bit mode, on any CPU that can run it at all. Depending on flavour of 16-bit mode, you might have a segment limit of 65536 *bytes*, but that's a bit-field limit of 524288 (512 Kib). In practice memory-destination BTS with a register bit-index is slower on modern CPUs that doing the addressing manually to load / BTS / store, so it's a single x86 instruction but the hardware only handles it by ~10 uops of microcode. – Peter Cordes May 13 '21 at 03:19
3

RISC-V has a candidate ISA for 128b, RV128I.

https://github.com/brucehoult/riscv-meta/blob/master/doc/src/rv128.md

However, at this point it is only an architecture and it isn't frozen.

Olsonist
  • 2,051
  • 1
  • 20
  • 35