Why is this Java code 6x faster than the identical C# code?

Question

I have a few different solutions to Project Euler problem 5, but the execution time difference between the two languages/platforms in this particular implementation intrigues me. I didn't do any optimization with compiler flags, just plain javac (via commandline) and csc (via Visual Studio).

Here's the Java code. It finishes in 55ms.

public class Problem005b
{
    public static void main(String[] args)
    {
        long begin = System.currentTimeMillis();
        int i = 20;
        while (true)
        {
            if (
                    (i % 19 == 0) &&
                    (i % 18 == 0) &&
                    (i % 17 == 0) &&
                    (i % 16 == 0) &&
                    (i % 15 == 0) &&
                    (i % 14 == 0) &&
                    (i % 13 == 0) &&
                    (i % 12 == 0) &&
                    (i % 11 == 0)
                )
            {
                break;
            }
            i += 20;
        }
        long end = System.currentTimeMillis();
        System.out.println(i);
        System.out.println(end-begin + "ms");
    }   
}

Here is the identical C# code. It finishes in 320ms

using System;

namespace ProjectEuler05
{
    class Problem005
    {
        static void Main(String[] args)
        {
            DateTime begin = DateTime.Now;
            int i = 20;
            while (true)
            {
                if (
                        (i % 19 == 0) &&
                        (i % 18 == 0) &&
                        (i % 17 == 0) &&
                        (i % 16 == 0) &&
                        (i % 15 == 0) &&
                        (i % 14 == 0) &&
                        (i % 13 == 0) &&
                        (i % 12 == 0) &&
                        (i % 11 == 0)
                    )
                    {
                        break;
                    }
                i += 20;
            }
            DateTime end = DateTime.Now;
            TimeSpan elapsed = end - begin;
            Console.WriteLine(i);
            Console.WriteLine(elapsed.TotalMilliseconds + "ms");
        }
    }
}

If C# doesn't do lazy evaluation of conditionals, that might cause it. — Sam Dufel, May 10 '11 at 15:16
I can't imagine allocating two DateTime objects on the heap is THAT much slower. — rianjs, May 10 '11 at 15:16
By the way, [least common multiple](http://en.wikipedia.org/wiki/Least_common_multiple). — hammar, May 10 '11 at 15:17
Is your visual studio solution building in Debug or Release? If it is Debug then is your javac build also a debug build? I think you are better off with a Release build in Visual Studio but turn off optimizations for a better compare. — Mike Two, May 10 '11 at 15:18
@jjnguy: That's the proper term. But it's moot as C# (just like about every other programming language) uses short-circuit evaluation @Sam. — , May 10 '11 at 15:18
@Both-of-you It doesn't matter if it is being used in the loop at all. Differences in the types and how they are handled by the runtime environment can impact the performance. Also the DateTime struct is a value object and is little more than a long anyway. It should make little to no noticeable difference between using a c# long and a c# DateTime. — maple_shaft, May 10 '11 at 15:18
Time calculation and printing is *after* the endtime is recorded - so that can't be the difference. — Hans Keﬆing, May 10 '11 at 15:18
@Mike Two, I changed over to the release build type and it's now around 280ms across ~5 runs. Did you do anything to get the C# that much quicker? I just did a vanilla build. — rianjs, May 10 '11 at 15:26
@rianjs - No I just cut and past your code into a new console application project. I admit I'm on one heck of a machine, but still. — Mike Two, May 10 '11 at 15:35
@Mike Two, Interesting. I'm in the middle of a 10,000 run loop in the Java implementation to get an average to minimize the impact of VM setup, etc costs. C# is next using StopWatch. — rianjs, May 10 '11 at 15:59
@Mike Two, the C# run finished, and I was unable to get the results anywhere near as comparable as you. I've updated my OP to include the results, and breakdown as I saw it. — rianjs, May 10 '11 at 17:16
I ask because if the people like me seeing comparable performance are running in 64 bit mode it may be that what makes the difference is efficient enregistration and that, on x86, the java jit is better at this. — ShuggyCoUk, May 10 '11 at 20:08
I'm running Windows 7, 64-bit on a dual-core AMD Athlon 64 X2 3800+. The C# binary was 32-bit. The Java binary is 64 bit. I could try recompiling as a 64-bit binary in C# and see if things are any faster. — rianjs, May 10 '11 at 20:43
OK, wow, compiling/running it as a 64-bit binary (anycpu) results in an average runtime of 84ms across 1,000 runs. — rianjs, May 10 '11 at 21:09
Have you run your C# tests without the debugged and in release mode? (Release -> Ctrl-F5) — Pop Catalin, May 10 '11 at 23:04
I get 188-190 ms consistently for the C# code compiled for x86, and 50-64 ms consistently for the C# code compiled as "Any CPU" running on a 64-bit platform. And this is *not* a new machine by any stretch of the imagination. As others have said, please make sure that *all* benchmarking is done with optimizations turned *on*! In Visual Studio, that means compiling in "Release" mode, and starting the executable from outside of the IDE without the debugger attached. By default, JIT optimizations are disabled when the debugger is attached; not sure if that's the case for Java or not. — Cody Gray - on strike, May 10 '11 at 23:27
DateTime isn't that accurate for benchmarking (as stated [here](http://msdn.microsoft.com/en-us/library/system.datetime.now.aspx) (read the remarks)), take a look at the [stopwatch class](http://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.aspx) — George Duckett, May 10 '11 at 15:20

Femaref · Answer 1 · 2011-05-10T15:37:06.487

40

To time code execution, you should use the StopWatch class.
Also, you have to account for the JIT, the runtime etc, so let the test run a sufficient amount of times (like 10,000, 100,000 times) and get some sort of average. It is important to run the code multiple times, not the program. So write a method, and loop in the main method to get your measurements.
remove all debugging stuff from the assemblies and let the code run stand-alone in a release build

edited May 10 '11 at 15:37

answered May 10 '11 at 15:17

Femaref

60,705
7
138
176

I did ~10 runs of both. The ms ticks on both are approximate averages. They don't really vary much between runs, maybe +- 10ms at most. Either way, I can't imagine it accounts for a ~6x difference. – rianjs May 10 '11 at 15:20
2

Using the stopwatch or DateTime is irrelevant, however running the test multiple times is important therefore +1. – Nick May 10 '11 at 15:21
16

It is quite relevent, DateTime is less accurate – MattDavey May 10 '11 at 15:22
6

`DateTime` is inaccurate for this kind of measurement in the millisecond area. – Femaref May 10 '11 at 15:23
OK, I'll do some runs using StopWatch. I still can't imagine it's going to be that significant. I'll update the OP in a few minutes. Edit: the MSDN article says its resolution is ~10ms. That's not going to account for the difference by a long shot. – rianjs May 10 '11 at 15:24
@MattDavey: not if as is suggested the number of times the loop is run is 10,000 or 100,000. – Nick May 10 '11 at 15:25
2

Femaref definitely has the point! Especially with the 2. Also run only the computational part in the loop and not whole main(). – Jan Zyka May 10 '11 at 15:30
2

"I did ~10 runs of both" - the key is running the code being benchmarked N times, rather than running the program N times, to remove any startup costs of the virtual machine and platform. – matt b May 10 '11 at 15:31
@Nick it depends if he uses DateTime to measure the entire execution time as a whole, or measures each test iteration and calculates an average. – MattDavey May 10 '11 at 15:45
@MattDavey: Indeed. I'd suggest to him to time the whole execution rather than each individual run otherwise no matter what timing mechanism used, cumulative errors will be introduced. – Nick May 10 '11 at 15:53
Just a preliminary update: I've addressed all three issues you've raised (thanks!), and while I'm still waiting for the C# code to finish, I can say for sure that it's safe to rule out debugging stuff, StopWatch inaccuracy, and setting up runtime/JIT stuff. I'm at 20 minutes and counting so far, whereas the Java solution took ~7 minutes. – rianjs May 10 '11 at 16:40

score 24 · Answer 2 · answered May 10 '11 at 16:32

There are a few optimizations possible. Maybe the Java JIT is performing them and the CLR is not.

Optimization #1:

(x % a == 0) && (x % b == 0) && ... && (x % z == 0)

is equivalent to

(x % lcm(a, b, ... z) == 0)

So in your example the comparison chain could be replaced by

if (i % 232792560 == 0) break;

(but of course if you've already calculated the LCM, there's little point in running the program in the first place!)

Optimization #2:

This is also equivalent:

if (i % (14549535 * 16)) == 0 break;

or

if ((i % 16 == 0) && (i % 14549535 == 0)) break;

The first division can be replaced with a mask and compare against zero:

if (((i & 15) == 0) && (i % 14549535 == 0)) break;

The second division can be replaced by a multiplication by the modular inverse:

final long LCM = 14549535;
final long INV_LCM = 8384559098224769503L; // == 14549535**-1 mod 2**64
final long MAX_QUOTIENT = Long.MAX_VALUE / LCM;
// ...
if (((i & 15) == 0) &&
    (0 <= (i>>4) * INV_LCM) &&
    ((i>>4) * INV_LCM < MAX_QUOTIENT)) {
    break;
}

It is somewhat unlikely that the JIT is employing this, but it is not as far-fetched as you might think - some C compilers implement pointer subtraction this way.

@Kevin, in a misguided attempt to roll my own encryption scheme a few years ago :-) — finnw, May 11 '11 at 12:09

score 12 · Accepted Answer · answered May 11 '11 at 20:55

The key to making these two become closer is to ensure that the comparison is fair.

First of all ensuring that costs associated with running Debug builds, loading pdb symbols as you did.

Next you need to ensure that there are no init costs being counted. Obviously these are real costs, and may matter to some people, but in this instance we are interested in the loop itself.

Next you need to deal with the platform specific behaviour. If you are on a 64bit windows machine you may be running either in 32bit or 64bit mode. In 64bit mode the JIT is different in many respects, often altering the resulting code considerably. Specifically, and I would guess pertinently, you get access to twice as many general purpose registers.

In this case the inner section of the loop, when naively translated into machine code, would need to load into registers the constants used in the modulo tests. If there are insufficient to hold everything needed in the loop then it must push them in from memory. Even coming from level1 cache this would be a significant hit compared to keeping it all in registers.

In VS 2010 MS changed the default target from anycpu to x86. I have nothing like the resources or customer facing knowledge of MSFT so I won't try to second guess that. However anyone looking at anything like the performance analysis you are doing should certainly try both.

Once those disparities are ironed out the numbers seem far more reasonable. Any further differences likely require better than educated guesses, instead they would need investigation into the actual differences in the generated machine code.

There are several things about this I think would be interesting for an optimising compiler.

The ones finnw already mentioned:
- The lcm option interesting but I can't see a compiler writer bothering.
- the reduction of division to multiplication and masking.
  - I don't know enough about this, but other people have tried note that they call out the divider on the more recent intel chips significantly better.
  - Perhaps you could even arrange something complex, with SSE2.
  - Certainly the modulo 16 operation is ripe for conversion into a mask or shift.
- A compiler could spot that none of the tests have side effects.
  - it could speculatively try to evaluate several of them at once, on a super scalar processor this could pump things along quite a bit faster, but would depend heavily on how well the compilers layout interacted with the OO execution engine.
- If register pressure was tight you could implement the constants as a single variable, set at the start of each loop then increment as you go along.

These are all utter guesses, and should be viewed as the idle meanderings. If you want to know disassemble it.

score 3 · Answer 4 · edited May 23 '17 at 12:26

(Moved from OP)

Changing the target from x86 to anycpu has lowered the average execution time to 84ms per run, down from 282ms. Maybe I should split this off into a second thread?

UPDATE:
Thanks to Femaref below who pointed out some testing problems, and indeed, after following his suggestions, the times are lower, indicating that VM setup time was significant in Java, but probably not in C#. In C#, it was the debug symbols that were significant.

I updated my code to run each loop 10,000 times, and only output the average ms at the end. The only significant change I made was to the C# version where I switched to the [StopWatch class][3] for greater resolution. I stuck with milliseconds because it's good enough.

Results:
The testing changes don't explain why Java is (still) so much faster than C#. C# performance was better, but this can be explained entirely by removing the debug symbols. If you read [Mike Two][4] and I's exchange on the comments attached to this OP, you'll see I got ~280ms average in five runs of the C# code just by switching from Debug to Release.

Numbers:

A 10,000 count loop of the unmodified Java code gave me an average of 45ms (down from 55ms)
A 10,000 count loop of the C# code using the StopWatch class gave me an average of 282ms (down from 320ms)

All of this leaves the difference unexplained. In fact, the differential got worse. Java went from being ~5.8x faster to ~6.2x faster.

score 2 · Answer 5 · answered May 10 '11 at 15:29

This is too short a task to do proper timing for. You need to run both at least 1000 times and see what happens. It kind of looks like you're running these from a command line, in which case you're possibly comparing the JIT compilers for both. Try putting both behind buttons in a simple GUI and have that button loop over this a few hundred times at least before returning the elapsed time. Even ignoring JIT compiling, the timing could be thrown off by the granularity of the OS scheduler.

Oh, and because of JIT... only count the SECOND result of a button press. :)

score 1 · Answer 6 · answered May 10 '11 at 15:17

1

Perhaps because the construction of the DateTimeobjects is much more expensive than the System.currentTimeMillis.

answered May 10 '11 at 15:17

BertNase

2,374
20
24

3

I highly doubt it. DateTime is a value object and is little more than a long value with multiple calculated fields. – maple_shaft May 10 '11 at 15:22

Peter Lawrey · Answer 7 · 2011-05-10T15:38:49.230

1

In Java I would use System.nanoTime(). Any test which takes less than 2 seconds should be run for longer. It is worth noting that Java is pretty good at optimising inefficient code or code which does nothing. A more interesting test would be if you optimised the code.

You are trying to get a solution which you can determine without using a loop. i.e. a problem which would be done better another way.

You want the product of the factors of 11 to 20, which are 2,2,2,2,3,3,5,7,11,13,17,19. Multiply these together and you have the answer.

edited May 10 '11 at 15:38

answered May 10 '11 at 15:28

Peter Lawrey

525,659
79
751
1,130

1

You'll note that the first sentence of my post says "I have a few different solutions to Project Euler problem 5, but the execution time difference between the two languages/platforms in this particular implementation intrigues me." – rianjs May 10 '11 at 15:56
Apart from repeating the claim that Java is good at optimising inefficient solutions, it is worth nothing that Java will compile a single loop to native code, and perhaps the C# takes 6x long before it decides to do this. i.e. you could be testing how long it takes for the JIT to kick in ;) – Peter Lawrey May 10 '11 at 16:26
Haha, so far not yet. ;) I'm running a loop of 10,000 in C#, and we're up over 20 minutes already... I'll update the OP when it's done with the results. So far, the top-rated solution is pointing out test methodology defects, which I've eliminated. The huge differential remains. – rianjs May 10 '11 at 16:38
@rianjs, I suggest you try simplifying the loop to see what happens. e.g. take out one expression at a time, until you have a single `%`. – Peter Lawrey May 10 '11 at 16:49

Why is this Java code 6x faster than the identical C# code?

7 Answers7