0

I need to do arithmetic operations on 256-bit unsigned integer and I need a fast implementation. AFAIK, SIMD instructions do not help because of carry propagation (Can long integer routines benefit from SSE?) and Intel's ADX instructions can help.

How can I utilize Intel's ADX instructions to make addition and multiplication faster?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
phqb
  • 351
  • 3
  • 14
  • 1
    `adox` / `adcx` don't help for addition; there's only one dependency chain. (And it's short enough that if you have multiple independent 256-bit adds to do, out-of-order exec can overlap the dep chains of regular `adc`.) For multiply, see Intel's whitepaper on using mulx + adcx/adox; it has examples bigint mul. https://web.archive.org/web/20220901040851/https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf – Peter Cordes Apr 05 '23 at 06:17
  • 1
    See also [Can long integer routines benefit from SSE?](https://stackoverflow.com/a/8867025) - yes, if you're willing to use only some of the bits per SIMD element, so carry propagation can be deferred. – Peter Cordes Apr 05 '23 at 06:22
  • https://godbolt.org/z/PoWTvnobo shows clang code-gen for C23 `unsigned _BitInt(256)` multiply, but unfortunately even with `-O3 -march=skylake` it only uses `mulx`+`adc`, not adox+adcx. It involves a lot of `mov` operations, but only one `setb` (aka `setc` to materialize CF into an integer to be added later). – Peter Cordes Apr 05 '23 at 07:08
  • 1
    @PeterCordes I recently tried 256x256->512 mul in asm, one version [using `adc`](https://bbs.emath.ac.cn/forum.php?mod=viewthread&tid=5898&page=1#pid95535) and one using `adx`. Both have similar throughput(as it's easier tested) and should also have similar latency since each carry chain is short enough. `adc` take some extra registers though – l4m2 May 04 '23 at 19:06
  • @l4m2: Interesting. I guess the key difference (vs. 512x512) is that you don't run out of registers when keeping around groups of 4 partial products to add with separate chains. With a 512-bit number in 8 registers, there wouldn't be room for another 8 partial products without spill/reload. – Peter Cordes May 04 '23 at 19:17
  • @PeterCordes Maybe there should be extra split doing 512x512 without adx – l4m2 May 04 '23 at 20:12
  • [This](https://pastebin.com/b6WgH5dM) gets slow, likely because of insufficant adders or decoder, but I'm not sure which – l4m2 May 06 '23 at 00:40

0 Answers0