C++ (Intel) vs Java (Hotspot) vs C# benchmark questions (code and results included)

Question

I have been comparing the raw CPU performance speed between the three main languages (code and results are below). I am very curious as to how the main languages compare for raw computational power. I had a theory that Java and C# could possibly rival C++ when there was no memory overhead involved.

My questions:

1) Edited (C++ timings now more realistic)

2) Am I right in thinking the JVM took ages on the first iteration, but for the second it had finished analysing and therefore optimised? How did Hotspot know to finish optimising after the first iteration of my outside loop and not halfway through?

3) Why does C# not perform like Java and heavily optimise at the start? What is different about C# with regards to Java? Why is the C# slower- is it simply due to less optimization?

4) Is there any specific reason why the oscillation between 2246 and 2262 milliseconds for the C# test timings, could this be two different times because the CPU has two cores?

EDIT: Updating code to show stopwatch usage in C# code.

EDIT: Correct C++ timing code and results

The setup:

C++: VS2010 and Intel Compiler (built in release mode, Optimization: O2, Enable intrinsic functions: yes, favor size nor speed: neither, omit frame pointers: No, enable fiber-safe optimizations: no, whole program optimization: yes)
Java: Eclipse, Hotspot 64 bit compiler version 17, Java 1.6
C#: VS2010 and .net 4.0 (built in release mode)
CPU: Intel E6600 (2.4GHz) running at 2.7GHz, bus speed 300MHz, 8GB memory, DRAM Freq: 375MHz
Win 7 (64 bit)

C++ code:

#include "stdafx.h"
#include <iostream>
#include <stdio.h>
#include <windows.h>
#include <mmsystem.h>
#include <stdio.h>
#include <fstream> 

using namespace std;


double PCFreq = 0.0;
__int64 CounterStart = 0;

void StartCounter()
{
    LARGE_INTEGER li;
    if(!QueryPerformanceFrequency(&li))
        cout << "QueryPerformanceFrequency failed!\n";

    PCFreq = li.QuadPart;

    QueryPerformanceCounter(&li);
    CounterStart = li.QuadPart;
}
double GetCounter()
{
    LARGE_INTEGER li;
    QueryPerformanceCounter(&li);
    return double(li.QuadPart-CounterStart)/PCFreq;
}

static long counter = 0;

int _tmain(int argc, _TCHAR* argv[])
{

    for (int m = 0; m < 10; m++)
    {
        StartCounter();
        counter = 0;

        for (int j = 0; j < 3; j++)
        {
            //Just to test timing is working correctly
            //int* p = new int;

            for (long i = 0; i < 200000000; i++)
            {
                counter++;
            }
        }

        cout << GetCounter()*1000000 << " microseconds" << endl;
    }


    int p = 0;
    cin >> p;
    return 0;
}

C++ results:

7.19 microseconds

1.89

2.27

1.51

4.92

10.22

10.22

9.84

9.84

10.6

Java code:

public class main {

    static long counter = 0;

    public static void main(String[] args) {

        for(int m=0; m<10; m++){
            long start = System.nanoTime();
            counter = 0;

            for(int j=0;j<3; j++){
                for(long i=0; i<200000000; i++){
                    counter++;
                }
            }

            System.out.println(((System.nanoTime()-start)/1000000) + " ms");
        }
    }
}

Java results:

5703 milliseconds
471 ms
468 ms
467 ms
469 ms
467 ms
467 ms
467 ms
469 ms
464 ms

C# code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics

namespace t1
{
    class Program
    {
        static long counter = 0;

        static void Main(string[] args)
        {
            for (int m = 0; m < 10; m++)
            {
                Stopwatch s = new Stopwatch();
                s.Start();
                counter = 0;

                for (int j = 0; j < 3; j++)
                {

                    for (long i = 0; i < 200000000; i++)
                    {
                        counter++;
                    }

                }
                s.Stop();
                Console.WriteLine(s.Elapsed.TotalMilliseconds + " ms");
            }

            Console.ReadLine();
        }
    }
}

C# results:

2277 milliseconds

2246 ms

2262 ms

2246 ms

2262 ms

2246 ms

2262 ms

2246 ms

2262 ms

2262 ms

For the C++ version, check the assembly output by your compiler. The loop could be optimized out entirely. — Mat, Apr 21 '12 at 14:56
The C++ one probably optimized away the looping (since you were looping a constant amount). To the code itself, all it cares about is the counter value, and since the compiler detects it will always loop the exact same (3 * 200M), it can completely remove it. — wkl, Apr 21 '12 at 14:56
You don't need a great compiler to constant-fold that loop away, *I* could write a pass that does that (for primitive types). I suppose the only reason the JVM/CLR JIT doesn't do the same thing is some problem with your Java/C# benchmarks (does the JIT even kick in?). — , Apr 21 '12 at 14:57
Just have to say: IMHO this kind of benchmarking is archaic in todays environment. Multithreading is the word of the new millenium. so come to whatever conclusions you want here, but just understand that you are calculating numbers that may no longer be relevant — ControlAltDel, Apr 21 '12 at 14:59
How about doing some actual useful work in your benchmark? The C++ probably realized that your test sucks, and didn't execute the loop at all. — CodesInChaos, Apr 21 '12 at 14:59
For C# did you run in Visual Studio, or outside? Running with a debugger attached disables most optimizations. Just building in release mode is not enough. — CodesInChaos, Apr 21 '12 at 15:02
@user1291492 I'm not sure what benchmarks you suggest instead. I agree that such microbenchmarks are useless for most application - however, that applies to all micro-benchmarks, multi-threaded or not. And in those cases where micro-benchmarks are useful, the application probably still does heavy listing on the individual cores, regardless of how many cores are used in parallel. Knowing how fast one can crunch numbers on one core still has some merit when you use N cores. Of course, in both cases you need to know how long you're actually calculating rather than waiting for I/O or a lock. — , Apr 21 '12 at 15:02
@user1291492 So I'm not allowed to have an interest in the performance of the languages/compilers? How do you multi-thread a deterministic state machine... you cant! — user997112, Apr 21 '12 at 15:07
@delnan Of you can definitely extract good information with benchmarking and multithreading. I just started using node recently, and I ran a test where I was firing http requests at node from java using different numbers of spawned node clusters and # of threads in java. Taught me a lot about the sweet spots both in Java and in Node — ControlAltDel, Apr 21 '12 at 15:07
@delnan It's your run of the mill "dial your score" benchmark. For Hotspit: The first ~10k iterations of the loop are run in the interpreter, then the JIT kicks in and removes the rest of the loop. Unimportant how large you make the loop it'll always take aprox. the same amount of time.. But then I'm not exactly sure what the interest here is.. testing how fast an empty loop will be? Yep I'm sure that problem will come up at least once a day while programming. — Voo, Apr 21 '12 at 15:07
`DateTime.Now` is not useful for measuring time at that level. You're looking for [`Stopwatch`](http://msdn.microsoft.com/en-us/library/system.diagnostics.stopwatch.aspx). Also it doesn't look like you're using the QueryPerformanceCounter correctly. `(stop-start)/freq` = seconds. — user7116, Apr 21 '12 at 15:08
@CodeInChaos I originally ran in VS, I have just re-ran outside and now the times are still 1918 milliseconds. — user997112, Apr 21 '12 at 15:09
@Voo I'm aware of how it's supposed to work. But since the C# benchmark does not improve at all, and the Java one doesn't come close to C++, I suppose the benchmarks are flawed in that they don't give the respective JITs the best conditions (the C++ benchmark is an AOT-compiler writer's dream). — , Apr 21 '12 at 15:09
@user997112 please I'm not trying to put you down about this. I'm just expressing my opinion about the value of different types of benchmarking... In the end, it's really just my sole opinion, so take it for what its worth — ControlAltDel, Apr 21 '12 at 15:09
@sixlettervariables I think `DateTime.UtcNow` is perfectly fine for this kind of benchmarks, at least if you choose a reasonable iteration count, so the total time is >1 sec. In my experience random fluctuations for other reasons cause much bigger issues, and you need that high iteration count anyways, to minimize one-off effects like JIT overhead. — CodesInChaos, Apr 21 '12 at 15:12
@user1291492, no offence taken at all. It's just I feel (to some degree) "multithreading" has been so overhyped and there are still plenty of scenarios where it cannot overly help. — user997112, Apr 21 '12 at 15:12
If you want good code generation with C#, you should look into Mono with LLVM backend. — CodesInChaos, Apr 21 '12 at 15:14
I have just tried using the stopwatch and it made absolutely no difference. — user997112, Apr 21 '12 at 15:18
@delnan Ah sorry then. Looking at it more closely, I think the problem here is with OSR. We're JITing the inner loop, but that doesn't actually help us at all, because the loop actually has side effects (increase the variable). The JIT doesn't compile the part where it would notice that it can remove counter completely (actually since we're incrementing a static variable it probably couldn't then either). There's a reason why there are lots of rules for how to write correct micro benchmarks in JITed languages - and the above benchmark violates basically each of those. — Voo, Apr 21 '12 at 15:19
@Voo, would you be able to point me to some resources where I could learn more about how the JIT optimizes? — user997112, Apr 21 '12 at 15:21
@user997112 Not really, I mean I don't think there are many resources for how gcc optimizes code either. You'll have to read the source code if you're interested in that.. no idea if there are blogs about that in specific - [cliff](www.cliffc.org/blog/) mentions stuff somewhere but that's a bit tangentially. Otherwise: Think about what exactly you want to test and look at the generated assembly - AFTER making sure you write a correct microbenchmark (search on SO, that comes up often enough) — Voo, Apr 21 '12 at 15:26
JIT's kicking in or not, the overall result is that over naive code the C++ optimizer does better. And that's a pretty important result, because programmers don't pay attention to the best-code-for-the-JIT/compiler 100% of the time. In other words, a naive benchmark could be a better model of a real world program than a carefully crafted one. — dsign, Apr 21 '12 at 15:36
@dsign, I am not offended but could you elaborate on "naive code". I didn't want anything elaborate, just a simple test I could duplicate for all 3 languages which was quick to write. What would be an un-naive test? — user997112, Apr 21 '12 at 15:38
@user997112 Actually, I was just thinking aloud: people usually opposes benchmarking C++ against Java and C# saying that it is hard to get a smart and perfect benchmark. But I argue that *if* your benchmark were flawed, it still would be a perfectly valid real-world situation. I personally don't know enough to assert if your benchmark is correct or not, and for the aforementioned reason, I take it as good enough. — dsign, Apr 21 '12 at 15:43
The problem you have is that the code clearly doesn't do anything useful which means you don't have meaningful results. By this I mean it tells you nothing about real programs. If you believe the result from C++ you appear to be doing 6 bn iterations in 1.5 microseconds or 4 million per nano-second. Do you really believe your CPU is capable of doing that? Even the Java numbers suggest 12 operations per nano-second which is unlikely. — Peter Lawrey, Apr 21 '12 at 19:07
@Peter Lawrey, no which is why I have posted on SO to ask these kind of things and learn what is actually happening to the code... — user997112, Apr 21 '12 at 20:26
The whole point of this is to see how the different languages compare with the SAME code. It does not matter if this code is "useless". The whole point is that the code is the same for the different languages and I am comparing how the compilers handle these changes. I have learnt that the Intel compiler is extremely good at optimising and just how the JIT affects performance. Why make things "realistic" when (especially in technology) "realistic" seems to change every X years. My simple for loops won't become updated because they are just an abstract/academic comparison. — user997112, Apr 21 '12 at 20:31
So long as you realise that you can only draw conclusions for similar programs. i.e. with a loop which doesn't do anything. — Peter Lawrey, Apr 21 '12 at 20:39
@user997112 If you want to learn what each compiler does to the posted code that is one thing, but you threw in the idea of "benchmarking" the selected pieces of code which is why this Q has been such a hot item. If you are really interested in micro-benchmarks of the various languages I've posted a site that compares the top languages against each other with an eclectic suite of programs. If you are interested in the optimizations and actions take by the compilers and JIT then you shouldn't be worried about benchmarks with such a trivial code suite. — Andrew T Finnell, Apr 21 '12 at 20:50
@Peter, I completely understand this doesn't show, for example, the full scope of C#'s ability. I wanted to start this question to try and just how extensive my results were. They must show something because its the same code. We have seen that the JIT has done a better job than the .net JIT for example. — user997112, Apr 21 '12 at 20:56
@Andrew, whether you like what the code is doing or not, it IS a benchmark.... just because it isn't fully testing the compilers/languages doesn't make it a lesser benchmark. If I want to buy a new processor I don't look at video games benchmarks, but I don't complain and say they aren't valid comparisons. They're just not what I would use. The very fact your link showed several different benchmark programs illustrates that comparing process is a highly-opinionated game, or we would only have one benchmarking program. — user997112, Apr 21 '12 at 21:01

score 2 · Accepted Answer · answered Apr 21 '12 at 15:20

You've got a logical problem in your C++ code which uses QueryPerformanceFrequency:

PCFreq = double(li.QuadPart)/1000000000.0; // <- this is not correct
PCFreq = li.QuadPart;                      // <- this is correct

You should just assign li.QuadPart to PCFreq and do your conversion to milliseconds or nanoseconds in your printing code:

// convert from seconds to milliseconds
cout << GetCounter() * 1000.0 << endl;

With this change I get actual timings for your C++ code. Whether or not these timings are "valid" or useful in making comparisons, I will not comment.

Thanks sixlettervariables. I got that timing code from an example online and I guess it was just spurious results that made it look correct. I have modified the code and it now takes between 1 and 10 microseconds. — user997112, Apr 21 '12 at 15:26

score 2 · Answer 2 · edited May 23 '17 at 10:28

2

1 - Sixlettervariables seems to have pointed out your mistake on this one

2 - The hotspot will optimize the code. Here is a similar question which also sees 10x speedups on loops. So what you see is expected output. First time a Java loop is run SLOW, why? [Sun HotSpot 1.5, sparc]

3 - I don't know enough about C# on this one to help. It's possible it doesn't optimize inner loops (you have 3 different loops). Maybe try extract the 2 loops you are testing into a completely separate method and see if that helps.

4 - DateTime is to represent a date and time, not high-precision timings. Hence it's not that accurate. DateTime.Now has a 10ms resolution as far as I'm aware

(FYI, this post has some good explanations on the JIT, C# and C++ optimizations which may help you: C++ performance vs. Java/C#)

edited May 23 '17 at 10:28

Community

1
1

answered Apr 21 '12 at 15:32

Bruce Lowe

6,063
1
36
47

No hotspot does indeed **not** optimize the code - which you can easily test because the performance is constant independent of the number of runs. Which I think has to do with 2 reasons: 1. we're using a non private static variable (that could be remedied) and 2. OSR will only JIT the increment loop and not the outer blocks. This means even if we defined the counter as a local variable it would still not remove the loop. – Voo Apr 21 '12 at 19:26
@Voo, you've mentioned "OSR" a few times, what does this mean? – user997112 Apr 21 '12 at 20:32
@user997112 Ups sorry, there I violated one of the basic principles again. OSR is On Stack Replacement. [This article describes it simply for Java](http://java.sun.com/developer/technicalArticles/Networking/HotSpot/onstack.html). Basically a JITed method is only used after we've finished compiling it, the next time when it is called. OSR instead compiles parts of code that is run by the compiler (eg a single loop) and we swap from interpreting to compiled code on the go. Has a couple of problems and is hardly ever useful in anything but micro benchmarks. – Voo Apr 21 '12 at 21:05
Very interesting! Does the oracle website contain a great deal of these articles describe the inner workings?? – user997112 Apr 21 '12 at 21:40

score 1 · Answer 3 · answered Apr 21 '12 at 14:58

1

I think it may be possible, that the compiler will calculate the value of counter at compiletime and does not iterate through your loop.

I think a simple counter is a really bad benchmark.

By the way, try to run the java-code within a method. It may be faster because of the JIT-optimization. (But I am not sure)

answered Apr 21 '12 at 14:58

Christian Kuetbach

15,850
5
43
79

Very nice! First iteration now down to 5700 ms, second 8100ms and then the rest at ~670ms. Does that suggest there's a method overhead of about 200ms? I just re-ran and it's still averaging 670ms on the remaining iterations... – user997112 Apr 21 '12 at 15:16
I think the Hotspot compiler needs some time to find code, which is worth to be compiled to mashine code. – Christian Kuetbach Apr 21 '12 at 20:00
Are you suggesting that not all the code is compiled to machine code? If so, I do not understand... – user997112 Apr 21 '12 at 22:40

score 1 · Answer 4 · edited May 27 '16 at 02:06

1

Benchmarks for the various compilers has already been done and put together very nicely.

The Computer Language Benchmark Games

Java 7 Server vs. GNU C++

C# Mono vs. GNU C++

C# Mono vs Java 7 Server

While Java 7 Server is faster than C# Mono 2.10.8 look at the amount of memory that Java 7 utilizes.

edited May 27 '16 at 02:06

igouy

2,547
17
16

answered Apr 21 '12 at 15:45

Andrew T Finnell

13,417
3
33
49

It looks nice, but be wary of the specific details. If you look at the code in the examples many of the C# functions are sub-optimal, while the Java functions are heavily optomized. For example: here the tight loop in "MakeCumlative" Java gets dense arrays of floats (cpu cache optimal) http://shootout.alioth.debian.org/u64q/benchmark.php?test=fasta&lang=java while c# gets a bloated array of objects http://shootout.alioth.debian.org/u64q/benchmark.php?test=fasta&lang=csharp this is just one example, I'm sure it cuts both ways, but its pretty clear its not an apples to apples comparison – Glenn Aug 03 '12 at 21:41
1

@Glenn Also keep in mind the comparison is against Mono not .NET. I've "heard" that .NET 4.0 is extremely fast. Of course it would be against their EULA to prove this. – Andrew T Finnell Aug 04 '12 at 01:20
Yeah I was reading some articles that talked about all the compiler tricks the MS.net team use to avoid things like bounds checking.. it sounded like a lot of work and like something I would expect mono to lag a bit on.. although mono did introduce true 64bit array support first so maybe in common primitive cases mono is more competitive. – Glenn Aug 04 '12 at 21:54

C++ (Intel) vs Java (Hotspot) vs C# benchmark questions (code and results included)

4 Answers4