5

I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. Since AVX2 has not such an instruction, is it reasonable for me to use Karatsuba algorithm for efficiency and gaining speed?

Z boson
  • 32,619
  • 11
  • 123
  • 226
Yigit Demirag
  • 344
  • 1
  • 12
  • This is very much dependent on the architecture. 25 years ago, I did use Karatsuba for 64x64->128 bit multiplication on 32-bit SPARC processors for some incremental performance gain. I have not looked at AVX2 yet, lacking a Haswell-class CPU in my machine here. Have you searched the literature (or the internet in general) to see what other people have found? How fast is your AVX2-based 64x64->128 bit multiply using the standard method? – njuffa Jun 26 '15 at 09:57
  • Do you really need 64bx64b to 128b? Or could you use 56bx56b to 106b? – Z boson Jun 26 '15 at 10:59
  • Actually I need 64bx64b -> 64b high bits such that http://codepaste.net/29m5qm ` – Yigit Demirag Jun 26 '15 at 11:07
  • You should change your tags and add AVX2 to your tags – Z boson Jun 26 '15 at 11:14
  • 1
    Also I think you should state in your question that you want 64bx64b to high 64b. – Z boson Jun 26 '15 at 11:15
  • 1
    Your question would have been more interesting I think if you asked if there was a fast way to do `high(64bx64b)` using AVX2. Your question could be considered a [xy problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) because you really want to find a faster way to do `high(64bx64b)`. – Z boson Jun 26 '15 at 13:22
  • @Zboson Does your answer at http://stackoverflow.com/a/28827013/1979163 on AVX2 make 4 high(64bx64b) calculations same time using SIMD? It was a bit hard to read for me, sorry. – Yigit Demirag Jul 01 '15 at 07:41
  • @user1979163, yes. I suggest you go through the scalar code at http://www.hackersdelight.org/hdcodetxt/muldws.c.txt before converting it to SIMD. I think my code is a bit different because their hypothetical risk CPU they based their algorithms on has `word*word to lo(word*word)` (e.g. 32x32 to 32) whereas AVX2 only has `half-word*half-word to word` (e.g. 32x32 to 64). – Z boson Jul 01 '15 at 08:56

3 Answers3

8

No. On modern architectures the crossover at which Karatsuba beats schoolbook multiplication is usually somewhere between 8 and 24 machine words (e.g. between 512 and 1536 bits on x86_64). For fixed sizes, the threshold is at the smaller end of that range, and the new ADCX/ADOX instructions likely bring it in somewhat further for scalar code, but 64x64 is still too small to benefit from Karatsuba.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • 1
    This answers the OPs question but I think it's more interesting if the OP asked if there was a way to beat `mulx` for `high(64bx64b)` using AVX2 (which is how I chose to interpret the OPs question; i.e. as xy problem). I think it might be possible in the floating point domain using double-double arithmetic (assuming the OP could stay in the float domain for most of the algorithm). – Z boson Jun 26 '15 at 13:17
  • If the questioner is really doing integer arithmetic, and the multiply is a nontrivial part of the code in question, it isn't possible to beat MULX. Using FMA is cute but suffers from two pain points if task is fundamentally integer in nature: (a) rounding of the operands and the high product and (b) if the inputs aren't always full width, the products end up having different exponents, which makes keeping subsequent operations exact painful. – Stephen Canon Jun 26 '15 at 13:41
  • You're probably right. I have not tried to implement this yet. It's an idea I have been considering. The rounding in some sense could be an up front cost which is what I meant by staying the the float domain for most of the algorithm. The different exponents is a problem. Perhaps a bigger problem is if addition is needed as well. Double-double addition is slow and moving back to the integer domain to do addition is another overhead and AVX2 compare and add would be lucky to break even with ADCX anyway. – Z boson Jun 26 '15 at 13:54
  • From my experience, the "useful" range for Karatsuba is shrinking very rapidly on modern processors. Stuff like MULX and ADX are making the basecase algorithm faster and faster thus pushing up the basecase/karatsuba threshold. On the other end, Floating-Point FFT is getting the full benefit of SIMD which is eating away at the Karatsuba/FFT threshold and pushing it down. – Mysticial Jun 26 '15 at 14:06
  • @Zboson: In a 32b process, it's a *much* more attractive option, because the widest scalar multiplier you have is 32x32 -> 64. But the real solution there is to just compile in 64b already =). – Stephen Canon Jun 26 '15 at 14:07
  • At the rate it's going, these thresholds will cross sometime in the next few generations of processors - probably even with AVX512-IFMA. IOW, Karatsuba is probably headed for the history books. – Mysticial Jun 26 '15 at 14:08
  • @Mysticial: I'm not sure I buy that (not yet, anyway). Last time I looked, Karatsuba still had the advantage for mainstream RSA keysizes (2k and 4k), which is a pretty important use case (leaving aside the discussion of whether or not it *should* be). I don't see IFMA being enough to change that (at least for 4k). – Stephen Canon Jun 26 '15 at 14:10
  • @StephenCanon 2k is almost at the point where Floating-Point FFT is faster than Karatsuba (at least for my implementations). – Mysticial Jun 26 '15 at 14:12
  • I should also point out that the general distrust of floating-point and SIMD amongst a cross-section of kernel and security folks (some reasonable and some misplaced) will undoubtedly keep Karatsuba in use for quite a while after it's superseded in performance. – Stephen Canon Jun 26 '15 at 14:17
  • @StephenCanon That's a good point. Some people avoid it even with provably correct bounds. – Mysticial Jun 26 '15 at 14:19
  • So @StephenCanon thanks for your answer. As Z Bozon pointed out; what would your solution for getting high(64bx64b) using SIMD on AVX2 ? – Yigit Demirag Jun 26 '15 at 15:23
  • @user1979163: If you want to stay in AVX2, look at Zbozon's answer to another question: http://stackoverflow.com/a/28827013/142434 – Stephen Canon Jun 26 '15 at 15:39
  • @StephenCanon His answer calculates 4 high(64bx64b) multiplication at the same time using SIMD on AVX2 right? (I modified it to AVX2 as it explained but did not understand algorithm clearly.) – Yigit Demirag Jul 01 '15 at 07:47
  • @user1979163: Correct. – Stephen Canon Jul 01 '15 at 10:46
4

It's highly unlikely that AVX2 will beat the mulx instruction which does 64bx64b to 128b in one instruction. There is one exception I'm aware of large multiplications using floating point FFT.

However, if you don't need exactly 64bx64b to 128b you could consider 53bx53b to 106b using double-double arithmetic.

To multiply four 53-bit numbers a and b to get four 106-bit number only needs two instructions:

__m256 p = _mm256_mul_pd(a,b);
__m256 e = _mm256_fmsub_pd(a,b,p);

This gives four 106-bit numbers in two instructions compared to one 128-bit number in one instruction using mulx.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I see. I will try 53x53 multiplication. Do you think that Karatsuba algorithm can provide faster solution? – Yigit Demirag Jun 26 '15 at 12:03
  • @user1979163, I know little about the Karatsuba algorithm (which is why I did not say anything about it ) but if I understand it correctly it's useful for very large integer multiplication much larger than than what the instruction set can do (64x64 to128 with mulx). You only asked about high(64bx64b). – Z boson Jun 26 '15 at 12:20
  • @ElderBug, please reread my answer. I show how to do integer multiplication use the floating point hardware. – Z boson Jun 26 '15 at 12:21
  • Indeed. Doesn't that mean that your second code is for high(53bx53b)=53b + 11b, and not high(64bx64b) ? Also, are you sure it's 53b ? The mantissa in `double` is 52b ... – ElderBug Jun 26 '15 at 13:00
  • One bit is implicit in floating point so its 52+1. It's high(64bx64b) because the integer (e.g. with `double ad = a`) is convert to the high 53bits not the low 53-bits. – Z boson Jun 26 '15 at 13:13
  • @ElderBug, I suggest you read the wiki link for double-double I linked to in my answer. – Z boson Jun 26 '15 at 13:26
  • Okay but, I mean, if the operands are larger than 53b, you will lose precision in the double form. Which means the integer part in the double form can't be more than 53b. In other words, as soon as you convert a 64b to double, you can't convert it back to 64b if it was larger than 53b, which means you can't retain 64bx64b in double form. – ElderBug Jun 26 '15 at 13:37
  • 1
    I'm pretty sure all the 64+64 bits enter the high(64bx64b). Just a quick example (with 32bx32b) : A = 0xE567 89AB, B = C123 4567, high(AxB) = 0xAD12 AA22. Now if I remove just 1 bit from A and B : A = 0xE567 89AA, B = 0xC123 4566, high(AxB) = 0xAD12 AA21. The 2x1b in the 2x32b entered the high(32bx32b). Or is there something wrong in my reasoning ? – ElderBug Jun 26 '15 at 13:56
  • @ElderBug, you're correct, I don't know how to fix it right now. If I have time I'll look into it later. If you up voted me please remove it. – Z boson Jun 26 '15 at 14:20
  • 1
    @Zboson: You can't fix it, sadly (or at least, not with reasonable performance). You really do need all the input bits to compute a correct multiply high (this is what I was referring to when I mentioned "rounding of the operands"). – Stephen Canon Jun 26 '15 at 14:23
  • AVX-512IFMA52 has instructions for doing 52bitx52bit integer multiplies. (presumably using the FPU hardware vector multipliers, hence the 52bits.) See `vpmadd52huq` (`_mm512_madd52hi_epu64`) in Intel's intrinsics guide, for example. There's also a mullo of packed 64bit ints, but I don't see a mulhi. – Peter Cordes Jul 03 '15 at 17:29
  • Also, `mul` and `imul` are 3 cycle latency (for the 64bit one-operand form that produces a 128bit result in `RDX:RAX`), but `mulx` has 4 cycle latency. And `mul` doesn't require any CPUID flags beyond 64bit mode. – Peter Cordes Jul 03 '15 at 17:39
  • 1
    @Zboson - I'm not following the snippet given in the answer. `fmsub(x,y,z)` is `x * y - z`, right? So as far as I can tell the snippet is `p = a * b; e = a * b - (p) = a * b - (a * b) = ~0`, right? – BeeOnRope Jan 05 '17 at 23:06
  • @BeeOnRope fma has [fma has (virtually) infinite precision of the intermediate result](http://stackoverflow.com/a/18239795/2542702). So ´`fma(a,b,-a*b)` is `(a*b)_fullprecision - (a*b)_limitedprecision`. My guess is `p` means product and `e` means error. So `fma(a,b,-a*b)` gives you the error in calculating `(a*b)_limitedprecision`. – Z boson Jan 06 '17 at 08:28
  • [See the comment by @StephenCanon](http://stackoverflow.com/questions/31069291/is-it-really-efficient-to-use-karatsuba-algorithm-in-64-bit-x-64-bit-multiplicat/31072201?noredirect=1#comment50168016_31073696). The idea in my answer is interesting but maybe not practical except possibly in some unusual cases. When I wrote this answer I had spend a lot of time with `double-double` arithmetic and the idea occurred to me that it could be also be used for higher precision integer multiplication. – Z boson Jan 06 '17 at 08:34
  • 1
    @Zboson There is a way to make it work. But from my tests, it's not quite enough to overcome `MULX/ADCX/ADOX`. So I look forward to AVX512. But it will be short-lived since AVX512-IFMA will supersede it. – Mysticial Jan 06 '17 at 19:13
  • The problem with the mul+fma method by itself is that `p` and `e` are saturating all 53 bits of precision. This isn't useful because you can't do anything to them. If you add them to an accumulator they'll overflow and lose precision. Even worse, they are beyond the range of the [fast double->int64 conversion trick](http://stackoverflow.com/a/41148578/922184). – Mysticial Jan 06 '17 at 19:13
  • 1
    The trick is to restrict your inputs to < 53 bits, then round `p = a*b` before the fma that produces `e`. But to make this work, you need to scale your inputs properly so that `a*b` is split down the middle by the decimal place. The sequence is: `p = a*b; p = round(p); e = fma(-a,b,p)` But then your outputs `p` and `e` are also scaled differently. Ultimately, the overhead of setting and correcting the scaling is what makes it slower than the usual scalar approach. There are tricks to eliminate some of the scaling, but it's hard to eliminate enough of it to make the approach viable. – Mysticial Jan 06 '17 at 19:30
  • @Mysticial if you have time could you provide an answer [here](http://stackoverflow.com/q/41403718/2542702). It would be interesting to see how to go from double-double to lo and high int64. I may try in work it out myself in the next two weeks. – Z boson Jan 06 '17 at 19:41
3

It's hard to tell without trying, but it might me faster to just use the AMD64 MUL instruction, which supports 64x64=128 with the same throughput as most AVX2 instructions (but not vectorized). The drawback is that you need to load to regular registers if the operands were in YMM registers. That would give something like LOAD + MUL + STORE for a single 64x64=128.

If you can vectorize Karatsuba in AVX2, try both AVX2 and MUL and see which is faster. If you can't vectorize, single MUL will probably be faster. If you can remove the load and store to regular registers, single MUL will be definitely faster.

Both MUL and AVX2 instructions can have an operand in memory with the same throughput, and it may help to remove one load for MUL.

ElderBug
  • 5,926
  • 16
  • 25
  • The most time spends on the part: __uint128_t product = ((__uint128_t)a)*((__uint128_t)b); What you can say about this answer then? http://stackoverflow.com/a/24575626/1979163 – Yigit Demirag Jun 26 '15 at 12:51
  • 1
    @user1979163 Did you enable all optimizations ? With GCC, the whole function translate to only 4 assembly instructions (`mov`,`mov`,`mul`,`mov`). And that's with inlining disabled. And the 4 instructions should have a good parallel throughput inside the CPU. I'm not sure you can do better. – ElderBug Jun 26 '15 at 13:11
  • Yes I enabled all optimizations and compiled as g++ -mavx2 -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 mulhiExample.cpp -std=c++11 -fabi-version=0 . Can you provide the code you compiled and flags you used? – Yigit Demirag Jun 26 '15 at 14:01
  • @user1979163 I used the code you provided, with static inline removed. I got the same assembly with your options, so you probably have the same result. Is this function only used in the same .cpp file ? If it is used elsewhere, it won't be inlined, even with `inline`. It can cost much in this case. To enable inlining, put the function definition in the .h, or compile with -flto. If you were only using this function in the same .cpp file, you're probably out of luck. – ElderBug Jun 26 '15 at 14:09
  • @user1979163 Note that if you use -flto, you need to pass -O3 to the linker command as well, or it will be useless. – ElderBug Jun 26 '15 at 14:12