1

If I have two large integers with thousands of digits, which instructions would be useful to multiply them? I'm assuming I will have to loop over the digits, multiply and add, like you do manual multiplication.

I don't expect any solutions here, just point me to the right instructions that would best accomplish the job. It can be any instruction from the MMX, SSE, x87 fpu etc...

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Roland
  • 7,525
  • 13
  • 61
  • 124
  • 1
    `mul` should do the job just fine. It is likely useful to vectorise this in which case `pmuldq` will be useful. MMX and the x87 FPU will likely not be useful. – fuz May 02 '20 at 09:38
  • 1
    How are they stored? Like fully packed binary with 64 value bits per 8-byte chunk? Or unpacked decimal digits with one digit per byte? Or Python-internals style with 30 bits per 32-bit integer to allow software carry techniques for languages with `adc`? See also [BigInt class in C++](https://codereview.stackexchange.com/a/237764) for a survey of some ways of doing BigInteger stuff. – Peter Cordes May 02 '20 at 09:51
  • @PeterCordes I don't know yet how I should store them. I'm grateful for suggestions. I'm doing this as a self-imposed exercise to get started(again) with assembly. – Roland May 02 '20 at 09:53
  • 2
    Generally even AVX2 is not useful because it doesn't provide carry across digits, and more importantly only has 32x32 => 64-bit multiply. [SIMD signed with unsigned multiplication for 64-bit \* 64-bit to 128-bit](https://stackoverflow.com/a/28811226). But there is stuff you can do with SIMD: [Can long integer routines benefit from SSE?](https://stackoverflow.com/a/8867025) - for large-enough numbers, FFT-based multiplication is apparently worth doing. – Peter Cordes May 02 '20 at 09:54
  • Ok, to get started you definitely need to understand the simple way first, starting with building a 128x128-bit multiply out of 64x64 => 128-bit multiplies. Look at GCC code-gen for `unsigned __in128` multiply on x86-64: https://godbolt.org/z/bCf4a2, doing low*low, and cross products. Pretty sure there's a duplicate on SO somewhere the coverse this. See also https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf which covers the standard way and shows speeding it up with `mulx` and `adc`. – Peter Cordes May 02 '20 at 09:58
  • Maybe a duplicate of [can multiprecision signed multiply be performed with imul instruction?](https://stackoverflow.com/q/20772280) which has diagrams. plain `mul` produces the high half of the full result in RDX to support extend-precision stuff. – Peter Cordes May 02 '20 at 10:02
  • 4
    Also, if you just need the job done efficiently, consider using something like `libgmp`. If you want to learn how to do it yourself, still read `libgmp` documentation. It discusses the algorithms used. That's a good starting point. – Margaret Bloom May 02 '20 at 10:56
  • 1
    You can do this more efficiently [using FFTs](https://en.wikipedia.org/wiki/Schönhage–Strassen_algorithm) – Paul R May 02 '20 at 11:13
  • 1
    A further benefit for libgmp is that it already has assembler code for many CPU architectures for the lowest level operations, and automatically switches to FFTs for large enough operands. – President James K. Polk May 02 '20 at 11:21
  • 2
    @PaulR Not really. Schönhage-Strassen really only gives an advantage over simpler methods like Karatsuba's once you have many ten-thousand digits. And getting Karatsuba faster than a straight implementation of elementary school multiplication is not trivial either, given that you can easily vectorise the latter. – fuz May 02 '20 at 12:04
  • 1
    @fuz thank you - I stand corrected. – Paul R May 02 '20 at 13:06

0 Answers0