How to implement addition and multiplication for 256-bit unsigned integer using Intel's ADX intructions?

Question

I need to do arithmetic operations on 256-bit unsigned integer and I need a fast implementation. AFAIK, SIMD instructions do not help because of carry propagation (Can long integer routines benefit from SSE?) and Intel's ADX instructions can help.

How can I utilize Intel's ADX instructions to make addition and multiplication faster?

`adox` / `adcx` don't help for addition; there's only one dependency chain. (And it's short enough that if you have multiple independent 256-bit adds to do, out-of-order exec can overlap the dep chains of regular `adc`.) For multiply, see Intel's whitepaper on using mulx + adcx/adox; it has examples bigint mul. https://web.archive.org/web/20220901040851/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf — Peter Cordes, Apr 05 '23 at 06:17
See also [Can long integer routines benefit from SSE?](https://stackoverflow.com/a/8867025) - yes, if you're willing to use only some of the bits per SIMD element, so carry propagation can be deferred. — Peter Cordes, Apr 05 '23 at 06:22
https://godbolt.org/z/PoWTvnobo shows clang code-gen for C23 `unsigned _BitInt(256)` multiply, but unfortunately even with `-O3 -march=skylake` it only uses `mulx`+`adc`, not adox+adcx. It involves a lot of `mov` operations, but only one `setb` (aka `setc` to materialize CF into an integer to be added later). — Peter Cordes, Apr 05 '23 at 07:08
@PeterCordes I recently tried 256x256->512 mul in asm, one version [using `adc`](https://bbs.emath.ac.cn/forum.php?mod=viewthread&tid=5898&page=1#pid95535) and one using `adx`. Both have similar throughput(as it's easier tested) and should also have similar latency since each carry chain is short enough. `adc` take some extra registers though — l4m2, May 04 '23 at 19:06
@l4m2: Interesting. I guess the key difference (vs. 512x512) is that you don't run out of registers when keeping around groups of 4 partial products to add with separate chains. With a 512-bit number in 8 registers, there wouldn't be room for another 8 partial products without spill/reload. — Peter Cordes, May 04 '23 at 19:17
@PeterCordes Maybe there should be extra split doing 512x512 without adx — l4m2, May 04 '23 at 20:12
[This](https://pastebin.com/b6WgH5dM) gets slow, likely because of insufficant adders or decoder, but I'm not sure which — l4m2, May 06 '23 at 00:40

How to implement addition and multiplication for 256-bit unsigned integer using Intel's ADX intructions?

0 Answers0