If I were to write assembly code for large integer calculations (e.g. prime factoring, modulo calculations, etc.) with a focus on speed, which architecture would be best suited for this: x86(-64), ARM, PowerPC, MIPS or others?
-
What language do you plan to use for this ? – Paul R Sep 09 '11 at 16:07
-
2Sorry but this question unanswerable in its present form: you're mixing architectures that are at this moment not even targetting the same market segments (mobile vs desktop vs server - so, best suited in what type of computer?). You could restate the question as best integer performance per Hertz or per Watt, that might be answerable... – fvu Sep 09 '11 at 16:11
2 Answers
IMO nothing beats x86-64, because no one else cares about higher-precision arithmetic
Many RISC architectures like MIPS, DEC Alpha, or RISC-V don't have a flag register so you'll need a separate instruction to get the carry. Therefore they're bad choices and eliminated right away. For example to do a += b
in MIPS you need
addu aLow, aLow, bLow # aLow += bLow
sltu tmp, aLow, bLow # carry: tmp = (aLow < bLow)
addu aHigh, aHigh, bHigh # aHigh += bHigh
addu aHigh, aHigh, tmp # aHigh += carry
With a carry flag you need only 2 instructions: add aLow, bLow; adc aHigh, bHigh
The MIPS designers could have done it better, but they didn't
Higher clock helps as Marco van de Voort said, but those architectures don't have 50%-100% faster clock than the equivalent x86. The remaining things he said are fairly incorrect. It's important to note that arbitrary-precision math isn't trivially parallelized, therefore
- Superscalar (more ALUs) doesn't help much because of the dependency chain, which prevents instructions to be run in parallel
- Multiple cores and/or multiple sockets are also relatively useless for the same reason.
- SIMD also won't generally help‡, because it's meant for processing multiple separate pieces of data and not a single big large integer. You don't have any way to propagate the carry from the lower limb to the next one. See
In short: You really want to calculate the carries in parallel which is very difficult
In the x86 world you already have the carry flag from the beginning. But later Intel introduced the ADX instruction set with the new instructions ADOX, ADCX and MULX for accelerating big integer arithmetic even further. How they helps is explained in Intel's paper New Instructions Supporting Large Integer Arithmetic on Intel Architecture Processors
But it's not only ADX that makes x86 fast. As I mentioned before (‡) SIMD doesn't really help, but nowadays on x86 things may be different. We have very long vectors in x86 (256 bits with AVX2, 512 bits with AVX512 and maybe longer in the future) so if you use various tricks like using a partial-word arithmetic to delay the carry propagation, or arrange the words in strange ways (like llhhllhhllhhllhh
) instead of linear like in normal big integer arithmetic (llllllllhhhhhhhh
) then SIMD may be faster than scalar operations. For more information you should read
- Can long integer routines benefit from SSE?
- practical BigNum AVX/SSE possible?
- Kogge-Stone Vector Addition
Of course AVX512 will only help if you have very large numbers. Otherwise for a 512-bit number you may have better results with scalar code
No other architectures currently have SIMD registers longer than 128 bits, so even if you can make use of SIMD on them, the cost of calculating the carry would far outweighs the cost of parallel addition. Again this is the reason x86 beats them all

- 37,963
- 15
- 156
- 475
-
1I think some AArch64 CPUs with SVE (Scalable Vector Extensions) have 256-bit vectors. (SVE intro: https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture). I'm not sure how relevant that is, because IDK if SVE has the necessary shuffles to do anything other than pure vertical operations. – Peter Cordes May 01 '19 at 06:54
-
1PowerPC has multiple FLAGS registers that I think solves the same problem as ADCX/ADOX. PowerPC64 should be pretty reasonable for BigInteger stuff on an instruction-count basis, similar to x86-64 unless there are any major gaps I'm not aware of. ARM64 aka AArch64 may be ok, too; it has 64x64 => high or low half multiply. (32-bit ARM has some 32x32 => 64-bit 2-output instructions, but AArch64 only has 32x32=>64 widening or 64x64 -> high half). But it has integer MAC (multiply-accumulate), unlike x86. `__int128 * __int128` is fewer instructions than on x86-64: https://godbolt.org/z/K1M_ZA – Peter Cordes May 01 '19 at 07:13
-
But yes, ISAs without FLAGS are definitely at a disadvantage for BigInteger: MIPS, RISC-V, and the obsolete Alpha. But that's far from universal among RISC ISAs; many RISCs care more about the real world than about RISC purity, e.g. POWER and ARM being the most obvious examples of architectures for which modern high-performance implementations exist. Of course I think x86-64 is probably still the best bet, but AFAIK it's not fundamentally better than POWER for this, unless x86 `mulx r64` (2 uops) is significantly cheaper than POWER64 `mulld` (64x64=>64) + `mulhdu` (64x64=>high 64). – Peter Cordes May 01 '19 at 07:21
-
The `pmull - Polynomial Multiply Long instructions (PMULL/PMULL2)` from ARM sounds intriguing https://en.wikichip.org/wiki/arm/armv8. I just got an ARM Cavium ThunderX2 server today with pmull so I'm starting to look into ARM tech. – Z boson May 02 '19 at 11:17
-
@Zboson yeah, the output is quite similar to PPC. I know that they have flags but I didn't know that they also have integer multiply-add – phuclv May 02 '19 at 11:17
If you work with a small number of variable-size numbers, I think POWER 6 would suit your needs best (although I didn't work with this architecture) since it provides high IPC and very high frequency (up to 5GHz).
It you work with a large number of fixed-size numbers, x86-64 would be the best choice as it has SIMD operations which work on 64-bit numbers, and you can use those operations to speed up long arithmetic operations on multiple numbers. You will likely need an SSE 4.2-capable CPU (Intel Nehalem/Westmere/Sandy Bridge, or the upcoming AMD Bulldozer) since the 64-bit compare instruction PCMPGTQ was only added in SSE 4.2
Also, these GMP benchmark results might be interesting for you

- 11,993
- 4
- 27
- 41
-
-
1Yes, but they do not support 64-bit SIMD operations, and using 32-bit SIMD operations is not worth it - simple (non-SIMD) 64-bit ALU operations will deliver better performance – Marat Dukhan Apr 23 '12 at 22:50