Why is the BigInteger.ModPow function in C# much slower than that in Java?

Question

I have found that the BigInteger.ModPow function in C# is very slow compared to the BigInteger.modPow function in Java. This makes me reluctant to use C# to implement functions that perform modular exponentiation.

I have written a test program to prove it.

C#

static void Main(string[] args)
{
    BigInteger num = BigInteger.Parse("444266014606582911577255360081280172978907874637194279031281180366057");
    BigInteger m = 2;
    Console.WriteLine("Start multiply.");
    Stopwatch stopwatch = Stopwatch.StartNew();
    for (int i = 3; i <= 200000; i++)
        m *= i;
    stopwatch.Stop();
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
    stopwatch.Reset();
    Console.WriteLine("Start mod pow.");
    stopwatch.Start();
    for (int i = 0; i < 10; i++)
        BigInteger.ModPow(3, m, num);
    stopwatch.Stop();
    Console.WriteLine(stopwatch.ElapsedMilliseconds);
}

An equivalent program in Java

public static void main(String[] args) {
    BigInteger num = new BigInteger("444266014606582911577255360081280172978907874637194279031281180366057");
    BigInteger m = BigInteger.TWO;
    System.out.println("Start multiply.");
    long startTime = System.currentTimeMillis();
    for (int i = 3; i <= 200000; i++)
        m = m.multiply(BigInteger.valueOf(i));
    System.out.println(System.currentTimeMillis() - startTime);
    System.out.println("Start mod pow.");
    startTime = System.currentTimeMillis();
    for (int i = 0; i < 10; i++)
        BigInteger.valueOf(3).modPow(m, num);
    System.out.println(System.currentTimeMillis() - startTime);
}

The program consists of 2 parts:

Calculate 200000! to produce a very large number m.
Calculate 3 ^ m mod num 10 times.

You may change the numbers or the loop count to try finding different results.

Here is an execution result on my computer.

Specifications

CPU: Intel(R) Core(TM) i3-8100 CPU @ 3.60GHz
.Net Version: .NET 6.0.102
Java Version: 17.0.1

C#

Start multiply.
19443
Start mod pow.
35292

Java

Start multiply.
14668
Start mod pow.
3462

It shows that the BigInteger.ModPow function in C# is about 10 times slower than that in Java. Does anyone know the reason? Is that a bug?

Which .Net did you use ? are you in release or debug ? Related https://github.com/mono/mono/issues/19908 ? — Orace, Feb 12 '22 at 09:15
I have tried both release build and debug build. The results are almost the same. — Quicksilver, Feb 12 '22 at 09:22
Here is where I download .NET SDK https://dotnet.microsoft.com/en-us/download/dotnet/6.0 — Quicksilver, Feb 12 '22 at 09:25
I modified `BigInteger.valueOf(3).modPow(m, num)` to `BigInteger.valueOf(3 + i).modPow(m.add(BIvOf(i)), num.add(BIvOf(i)))` in case if there was a cache. Java still outperform .Net — Orace, Feb 12 '22 at 09:33
see https://medium.com/swlh/a-performance-comparison-between-c-java-and-python-df3890545f6d however note that C measurement used GCC which is not the best (ideal would be Intel Compiler with Intel CPU) for this purpose (but take in mind I am extremly prejudiced against GCC)... Also why would you think **.Net** or **C#** should be fast? — Spektre, Feb 12 '22 at 09:34
A release build should do optimization for me so I think this could be a fair test. I don't think C# should be fast but I'm shocked if it is slower than Java that much. They are both high-level language with auto garbage collection so I believe that their performance should be similar. — Quicksilver, Feb 12 '22 at 09:53
I would recommend that you do these benchmarks with BenchmarkDotNet and JMH, which provide much more accurate results. See also https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — Sweeper, Feb 12 '22 at 10:05
It seems .NET 6 (C#,VB,F#) is more than twice as fast as Java, but that is not the issue at all here. This is simply about the implementation of the math functions. A library might be much faster. — Bent Tranberg, Feb 12 '22 at 10:06
There is an opened issue about perf of BigInteger here : https://github.com/dotnet/runtime/issues/41495 but no PR. As I understand it, the author @erik-nyström provide code and perf but no comparison with actual code... — Orace, Feb 12 '22 at 10:13
@Sweeper I'm not doing a benchmark. The elapsed times printed are just for your reference. I just want to say that the function is terribly slow. You may remove all the timer code and just simply feel the performance difference with your body. There is an obvious time difference! — Quicksilver, Feb 12 '22 at 10:15
TBH I don't entirely trust that loop with 10 iterations at the end, especially given the performance difference is also about a factor of 10. I'm not saying that that's the cause, just that it might be, so please check it. Aside from that, Java's and .NET's BigInteger have plenty of algorithmic differences between them as well. — harold, Feb 12 '22 at 10:25
I've tried removing the loop to let it just run `ModPow` once. The results are almost the same. The time elapsed of running `ModPow` 10 times is about 10 times of running that once for both languages. — Quicksilver, Feb 12 '22 at 10:28

score 4 · Answer 1 · edited Feb 18 '22 at 08:53

You can take a look at the .Net implémentation here and the java ones here.
It appears that the java ones was more heavily studied.

The bottom line is that the .Net source shows a plain vanilla binary exponentiation powermod algorithm, but Java's is quite sophisticated using sliding windows and Montgomery multiplication. The Java code also carefully "leans into" it's intrinsic functions and, in turn, some of those intrinsics are written specifically for big integer arithmetic.

As an experiment, I tried to port the Java code to C#. The goal was to pull apart how much of the performance difference comes from the code (algorithms and implementations), and how much comes from the difference in JIT compiler optimizations.

In order to have something to fairly compare against, this is how the Java version did on my PC:

Start multiply.
7473
Start mod pow.
1406

The times (I also printed a result to verify that the ported code is valid) for the C# versions (running on .NET 6.0), using BigInteger and the version ported from Java, were as follows:

Builtin:
Start multiply.
8059
Start mod pow.
15696
09F59D6D54CE55B44FDF4F4D70E81DBFC8034ECE19339BC7B922F94EA5
Ported from Java:
Start multiply.
8695
Start mod pow.
4971
00000009F59D6D54CE55B44FDF4F4D70E81DBFC8034ECE19339BC7B922F94EA5

I also replicated the experiment I suggested with doing the modpow only once, but that didn't result in anything interesting, just approximately cuts the time of the "modpow phase" in 10.

Some observations just based on the time:

That is approximately the same performance ratio between the Java code and the C#-with-builtin-BigInteger code as you observed: multiply a little faster in Java, modPow more than 10x faster in Java.
The C# and Java versions of the modpow code, ported with minor differences, so algorithmically identical, performed wildly differently.
The ModPow from Java, ported to C#, despite being a lot slower in C# than it was in Java, still had a significant edge over the builtin System.Numerics.BigInteger version in C#.
The ported version multiply is not only slower than it was in Java, it's slower than what C# has built in, but not a whole lot.

But why. Unfortunately I will make observations only about the C# version and speculation about the Java version. I did not manage to get the assembly code for the relevant functions out of the JVM. I tried various commands such as explained here and got nothing. Obviously comparing two things when I cannot even look at the second thing is less than ideal. I'd be happy to actually compare the two side by side if someone manages to extract the assembly of the Java methods.

I saw a lot of array bounds checks. Array bounds checks in the "default loop" (counting up from 0 to length - 1 and accessing the array directly with the loop counter) are often optimized out, but the Java code has a lot of backwards loops (due to using a big-endian limb order) and in the C# port of that code these bounds checks were not optimized out (hopefully that will be improved in future versions of .NET). It's possible, but I don't know that for sure, that array access in backwards loops are optimized by eg Oracle HotSpot, which would give it a (usually small) edge in nearly every significant function of this code (most of them have a backwards loop in them). That may be enough to explain the performance difference (between the regular C# version and the version that was ported from Java) of the "multiply phase". That still leaves the question of how the Java code is faster when running as actual Java code..
As expected, mulAdd is the single most time consuming function during the "modpow phase". My ported version of it looks like this:
```
 static int mulAdd(int[] _out, int[] _in, int offset, int len, int k)
 {
     ulong kLong = (uint)k;
     ulong carry = 0;
     offset = _out.Length - offset - 1;
     for (long j = len - 1; j >= 0; j--)
     {
         ulong product = (uint)_in[j] * kLong +
                       (uint)_out[offset] + carry;
         _out[offset--] = (int)product;
         carry = (product >> 32);
     }
     return (int)carry;
 }
```
I think that's a reasonable port, staying close to the original while not using many "Java-isms" (using unsigned integers instead of the & LONG_MASK that Java uses), and the associated assembly code doesn't even look too bad .. not great, it has a bunch of array bounds checks and useless movsxd instructions, but should that really cost a 3x slowdown? mulAdd has a self-time of about 2.4 seconds, so even if it's the other code that is unusually slow compared to what happens in Java, that wouldn't explain the difference: the time due to just mulAdd in C# is already more than the total time that Java spent on the entire "modpow phase".

All in all this really isn't a full explanation, maybe it raises more questions than it answers, at least it's another data point.

The ported code is not included in this question because 1) it's too big, and 2) the source that I ported it from is licensed as GPLv2, which is incompatible with Stack Overflow posts. It wouldn't be a "snippet", which is the exemption that's usually used to justify such inclusions.

@TheodorZoulias of course, but since you asked, I went ahead and ran it in debug mode as well. The ported-from-java version gets about a factor of 3 slower in debug mode. — harold, Feb 16 '22 at 10:32
Great work! You may try to use `Span` instead of array, there bound checks are optimised. Also why `in` and `out` arrays are of `int` and not `uint` doesn't the casts cost? — Orace, Feb 16 '22 at 17:17
@Orace Well I tried to keep it close to the Java version after all. The casts from `int` to `uint` are not a problem, for example an excerpt from the asm: `mov edi,dword ptr [rdx+r9*4+10h] \ imul rdi,r10` (this is the load of `_in[j]` followed by the multiplication by `kLong`) this shows the cast just disappeared. There is nothing to actually *do* for a cast from `int` to `uint` anyway, the bits stay the same only their interpretation changes. — harold, Feb 17 '22 at 03:19

Orace · Accepted Answer · 2022-02-14T07:47:28.470

3

You can take a look at the .Net implémentation here and the java ones here.
It appears that the java ones was more heavily studied.

edited Feb 14 '22 at 07:47

answered Feb 12 '22 at 16:09

Orace

7,822
30
45

4

The bottom line is that the .Net source shows a plain vanilla binary exponentiation powermod algorithm, but Java's is quite sophisticated using sliding windows and Montgomery multiplication. The Java code also carefully "leans into" it's intrinsic functions and, in turn, some of those intrinsics are written specifically for big integer arithmetic. – President James K. Polk Feb 12 '22 at 16:48
yep the **.Net** source you linked looks like naive power by squaring (which is fine for normal numbers) ... the multiplication looks slightly fishy to me but too lazy to analyze it further. – Spektre Feb 13 '22 at 06:10
2

The linked source is specifically for *small* powers, the fully general function is further down. It does use Barrett reduction (`FastReducer`), but that's not as efficient as Montgomery multiplication, and there's no sliding window trick in the .NET version so it ends up using not only less efficient modular multiplications but also more of them. – harold Feb 13 '22 at 10:01
I just leave a comment here for someone who wants to trace the method in .NET. The bitwise calculation entry method in is [here](https://github.com/dotnet/runtime/blob/bda8c9163afdd46de4f5e5459e16d948fbda467a/src/libraries/System.Runtime.Numerics/src/System/Numerics/BigIntegerCalculator.PowMod.cs#L251). Then, it goes to line 282>298 > 344>470 > 497>500. For FastReducer, please see [this](https://github.com/dotnet/runtime/blob/bda8c9163afdd46de4f5e5459e16d948fbda467a/src/libraries/System.Runtime.Numerics/src/System/Numerics/BigIntegerCalculator.FastReducer.cs). – Quicksilver Feb 13 '22 at 10:20

Why is the BigInteger.ModPow function in C# much slower than that in Java?

Specifications

2 Answers2