Is __int128_t arithmetic emulated by GCC, even with SSE?

Question

I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC.

The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions of GCC, depending on what SSE instructions are taken advantage of.

So, what parts of __int128_t arithmetic are emulated by GCC, and what parts are implemented with SSE instructions (if any)?

There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations). — Paul R, May 15 '13 at 15:29
There's not even a 128 bit *add* in SSE/AVX. You could emulate it with bitwise operations and shifts, but given that you already have proper 64 bit scalar arithmetic instructions in x86-64 which can easily be combined for 128 bit operations there would seem to be nothing to be gained from this. — Paul R, May 15 '13 at 16:07
Thanks @Paul. I made an answer out of this, I hope you don't mind. — Douglas B. Staple, May 15 '13 at 17:19
Also see [128-bit integer - nonsensical documentation](https://gcc.gnu.org/ml/gcc-help/2015-08/msg00176.html) on the GCC mailing list. Its an interesting discussion in that the devs talk about various implementation details to explain their reasoning for the wording of the docs. — jww, Sep 07 '15 at 00:01
Possible duplicate of [Is it possible to use SSE and SSE2 to make a 128-bit wide integer?](https://stackoverflow.com/questions/12200698/is-it-possible-to-use-sse-and-sse2-to-make-a-128-bit-wide-integer) — phuclv, Aug 03 '19 at 01:20
[Can XMM registers be used to do any 128 bit integer math?](https://stackoverflow.com/q/6738283/995714), — phuclv, Aug 03 '19 at 01:24

score 12 · Accepted Answer · answered May 15 '13 at 17:18

I was confusing two different things in my question.

Firstly, as PaulR explained in the comments: "There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations)". Considering this, 128-bit arithmetic has to be emulated on modern x86-64 based processors (e.g. AMD Family 10 or Intel Core architecture). This has nothing to do with GCC.

The second part of the question is whether or not 128-bit arithmetic emulation in GCC benefits from SSE/AVX instructions or registers. As implied in PaulR's comments, there isn't much in SSE/AVX that's going to allow you to do 128-bit arithmetic more easily; most likely x86-64 instructions will be used for this. The code I'm interested in can't compile with -mno-sse, but it compiles fine with -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 and performance isn't affected. So my code doesn't benefit from modern SSE instructions.

score 5 · Answer 2 · answered May 15 '13 at 13:48

5

SSE2-AVX instructions are available for 8,16,32,64-bit integer data types. They are mostly intended to treat packed data together, for example, 128-bit register may contain four 32-bit integers and so on.

answered May 15 '13 at 13:48

MBo

77,366
5
53
86

3

This is actually explained very nicely [on Wikipedia](https://en.wikipedia.org/wiki/128-bit): "Most modern CPUs feature SIMD instruction sets (SSE, AltiVec etc.) where 128-bit vector registers are used to store several smaller numbers, such as four 32-bit floating-point numbers. A single instruction can then operate on all these values in parallel. However, these processors do not operate on individual numbers that are 128 binary digits in length, only their registers have the size of 128-bits." – Douglas B. Staple May 15 '13 at 18:31

score 5 · Answer 3 · edited Aug 03 '19 at 01:19

5

Although SSE/AVX/AVX-512/etc. have no 128-bit mode (their vector elements are strictly 64-bit max, and operations will simply overflow), as Paul R has implied, the main CPU does support limited 128-bit operations, by using a pair of registers.

When multiplying two regular 64-bit number, MUL/IMUL can outputs its 128-bit result in the RAX/RDX register pair.
Inversely, when dividing DIV/IDIV can take its input from then RAX/RDX pair to divide a 128-bit number by a 64-bit divisor (and outputs 64-bit quotient + 64-bit modulo)

Of course the CPU's ALU is 64-bit, thus - as implied Intel docs - these higher extra 64-bit come at the cost of extra micro-ops in the microcode. This is more dramatic for divisions (> 3x more) which already require lots of micro-ops to be processed.

Still that means that under some circumstances (like using a rule of three to scale a value), it's possible for a compiler to emit regular CPU instruction and not care to do any 128-bit emulation by itself.

This has been available for a long time:

since 80386, 32-bit CPU could do 64-bit multiplication/division using EAX:EDX pair
since 8086/88, 16-bit CPU could do 32-bit multiplication/division using AX:DX pair

(As for additions and subtraction: thank to the support for carry, it's completely trivial to do additions/subtractions of numbers of any arbitrary length that can fill your storage).

edited Aug 03 '19 at 01:19

phuclv

37,963
15
156
475

answered Mar 26 '15 at 18:16

DrYak

1,086
1
10
15

1

“Of course the CPU lacks a proper 128bit ALU, so these are probably partially support by the microcode.” Really, you think that 64x64 multiplication, the **only** kind of multiplication available for 64-bit GPR operands in the x86-64 ISA, is implemented by microcode because “the CPU lacks a proper 128-bit ALU”? – Pascal Cuoq Mar 26 '15 at 18:39
@PascalCuoq: I've badly formulated my thoughs. (Sorry not native speaker). (For the record, back with the 80386, internally the multiplication was implemented internally with a loop of additions. The bigger the output of the multiplication -> the longer it takes to get the result.) Technology has moved on with recent processors, but I *suspect* that, because intel officially advertise their ALU as 64bits, that If you want to get big int results (64x64=128 instead of 64x64=64), it's *possible* that these extra higher 64bits will cost you a few extra micro-ops. – DrYak Mar 26 '15 at 19:14
again, there is only one multiplication of 64-bit GPR operands in the ISA. Do you think that Intel and AMD do not make it as fast as their can **knowing that it is the only one that programmers can call even if they only need the lowest 64 bits of the result**? “64-bit ALU” and “128-bit ALU” are only oversimplifications. When you know that the only multiplication instruction is a 64x64->128, you make an ALU that supports 64x64->128 multiplication, period. – Pascal Cuoq Mar 26 '15 at 19:32
2

@PascalCuoq: I happen to have looked up timings at intel. - 64 bits MUL instruction on the CPU doesn't take the same amount of time depending on result you read back 64bits x 64bits = 64bits or 64bits x 64bits = 128bits. those extra bits take extra time (well more precisely, the high value in RDX only becomes available later than the first low 64bits in RAX). (For 128bits results only. 64bits results of 32bits don't do this. So extra cycles cause by extra bits, not by spanning across 2 registers). – DrYak May 20 '15 at 17:45
1

(Splitting because of comment limit): To me, looks like indeed ALU only gives out 64bits at a time: If you want the extra high 64bits of your 128bits ISA "MUL", you wait for an extra micro-ops to be executed. That's only specific for 128bits output. (the 128bits looks indeed done by microcode: an extra micro-ops, that other wise isn't necessary) If your code uses 128bits results, that next instruction will need to wait a bit longer. (but given the lenght of these instructions its possible to hiden the latency). – DrYak May 20 '15 at 17:46
2

"there is only one multiplication of 64-bit GPR operands in the ISA" Well actually no. There are several multiplication instruction in the ISA. Only one of them is 64x64=128 (the one that implicitely use RDX:RAX pair as result), all the others are 64x64=64 (The on that explicitely specifiy the target register). The 128bit one is the only one that take one cycle longer for the upper 64bits. Same situation with divisions. – DrYak May 21 '15 at 00:17
You are right, I forgot about the 3-operand multiplication instruction that multiplies by an immediate value and specifies a target register. Thanks for all the information in the comments above. – Pascal Cuoq May 21 '15 at 08:16

Is __int128_t arithmetic emulated by GCC, even with SSE?

3 Answers3

Linked