Why Java turns out to be faster than C++ in this simple BubbleSort benchmark example?

Question

I hear from colleagues that C++ is faster than Java and when looking for top performance, especially for finance applications, that's the route to go. But my observations differ a bit. Can anyone point out failures on my experiment or add some scientific variables to the discussion?

Note1: I am using -O3 (maximum optimization) and -O2 with the C++ compiler.

Note2: The short and simple complete source codes for each language are included. Feel free to run on your own machine, make changes, draw conclusions and share.

Note3: If you put both source codes side by side in an editor, you will see that their implementations are equivalent.

UPDATE: I've tried clang++ and g++ with a variety of optimization options (-O2, -O3, -Os, -march=native, etc) and they all have produced slower results than Java. I think at this point to make C++ faster I have to dive into the generated assembly code and do some assembly programming. I'm wondering how practical is this approach (assembly programming and assembly debugging) when coding a large real-life application.

What does the benchmark do?

Create an int array in the heap (not in the stack)
Start the clock
Populate the array
Sort the array with bubble sort
Stop the clock

Do that 10 million times, discard the first 1 million for warming up and output the average, min and max time.

For C++ I get: (with -O3 and -O2)

$ g++ --version
g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

$ g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
$ ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1202 | Min Time: 1158 | Max Time: 212189

$ g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O2
$ ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1337 | Min Time: 1307 | Max Time: 36650

For Java I get:

$ java -version
java version "17.0.1" 2021-10-19 LTS
Java(TM) SE Runtime Environment (build 17.0.1+12-LTS-39)
Java HotSpot(TM) 64-Bit Server VM (build 17.0.1+12-LTS-39, mixed mode, sharing)

$ javac -cp . TimeBubbleSort.java
$ java -cp . TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 837.0 | Min Time: 812 | Max Time: 37196

Full C++ code:

#include <iostream>
#include <limits>
#include <sstream>

using namespace std;

// TO COMPILE: g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
// TO EXECUTE: ./TimeBubbleSort 10000000 1000000 60

long get_nano_ts(timespec* ts) {
    clock_gettime(CLOCK_MONOTONIC, ts);
    return ts->tv_sec * 1000000000 + ts->tv_nsec;
}

struct mi {
   long value;
};

void swapping(int &a, int &b) {
   int temp;
   temp = a;
   a = b;
   b = temp;
}

void bubbleSort(int *array, int size) {
   for(int i = 0; i < size; i++) {
      bool swaps = false;
      for(int j = 0; j < size - i - 1; j++) {
         if(array[j] > array[j+1]) {
            swapping(array[j], array[j+1]);
            swaps = true;
         }
      }
      if (!swaps) break;
   }
}

void doSomething(int *array, int size) {

    for(int z = 0; z < size; z++) {
        array[z] = size - z;
    }

    bubbleSort(array, size);
}

int main(int argc, char* argv[]) {
    
    int iterations = stoi(argv[1]);
    int warmup = stoi(argv[2]);
    int arraySize = stoi(argv[3]);
    
    struct timespec ts;
    
    long long x = 0;
    long long totalTime = 0;
    int minTime = numeric_limits<int>::max();
    int maxTime = numeric_limits<int>::min();
    
    int * array = (int*) malloc(arraySize * sizeof(int));
    
    for(int i = 0; i < iterations; i++) {
    
        long start = get_nano_ts(&ts);

        doSomething(array, arraySize);  
        
        long end = get_nano_ts(&ts);
        
        for(int j = 0; j < arraySize; j++) {
            x += array[j];
        }

        int res = end - start;
        
        if (res <= 0) res = 1;
        
        if (i >= warmup) {
            totalTime += res;
            minTime = min(minTime, res);
            maxTime = max(maxTime, res);
        }
    }
    
    int count = iterations - warmup;
    
    double avg = totalTime / count;
    
    cout << "Value computed: " << x << endl;
    
    stringstream ss;
    
    ss << "Iterations: " << count << " | Avg Time: " << avg;

    if (count > 0) {
        ss << " | Min Time: " << minTime << " | Max Time: " << maxTime;
    }
    
    cout << ss.str() << endl << endl;
    
    free(array);
        
    return 0;
}

Full Java code:

public class TimeBubbleSort {
    
    // javac -cp . TimeBubbleSort.java
    // java -cp . TimeBubbleSort 10000000 1000000 60
    
    private static void swapping(int[] array, int x, int y) {
        int temp = array[x];
        array[x] = array[y];
        array[y] = temp;
    }
    
    private static void bubbleSort(int[] array, int size) {
        for(int i = 0; i < size; i++) {
            int swaps = 0; // flag to detect any swap is there or not
            for(int j = 0; j < size - i - 1; j++) {
                if (array[j] > array[j + 1]) { // when the current item is bigger than next
                    swapping(array, j, j + 1);
                    swaps = 1;
                }
            }
            if (swaps == 0) break; // No swap in this pass, so array is sorted
        }
    }
    
    private final static void doSomething(int[] array, int size) {
        
        for(int z = 0; z < size; z++) {
            array[z] = size - z;
        }

        bubbleSort(array, size);
    }
    
    public static void main(String[] args) {
        
        int iterations = Integer.parseInt(args[0]);
        int warmup = Integer.parseInt(args[1]);
        int arraySize = Integer.parseInt(args[2]);
        
        long x = 0;
        long totalTime = 0;
        long minTime = Long.MAX_VALUE;
        long maxTime = Long.MIN_VALUE;
        
        int[] array = new int[arraySize];
        
        for(int i = 0; i < iterations; i++) {

            long start = System.nanoTime();
            
            doSomething(array, arraySize);
            
            long end = System.nanoTime();
            
            for(int j = 0; j < arraySize; j++) {
                x += array[j];
            }
            
            int res = (int) (end - start);
            
            if (res <= 0) res = 1;
            
            if (i >= warmup) {
                totalTime += res;
                minTime = Math.min(minTime, res);
                maxTime = Math.max(maxTime, res);
            }
        }
        
        int count = iterations - warmup;
        
        double avg = totalTime / count;
        
        StringBuilder sb = new StringBuilder();
        
        sb.append("Value computed: ").append(x).append("\n");
        
        sb.append("Iterations: ").append(count).append(" | Avg Time: ").append(avg);

        if (count > 0) {
            sb.append(" | Min Time: ").append(minTime).append(" | Max Time: ").append(maxTime);
        }
        
        System.out.println(sb.toString() + "\n");
    }
}

Did you enable all possible speed optimization flags, when compiling the "c/c++" code?? — πάντα ῥεῖ, Jun 24 '22 at 12:55
That's exactly what `-O3` does! So yes, I'm using maximum optimization for the C++ compiler. — SpeedChaser, Jun 24 '22 at 12:57
I think that the measured execution time is dominated by time measuring itself. Sorting 60 element array takes ~1usec, a similar time as syscall in `clock_gettime()`. Either run sorting multiple times between collecting timestamps or use a larger array. Something like 1000 should be enough. — tstanisl, Jun 24 '22 at 13:03
@SpeedChaser `-O3` enables a bunch of optimization, but `-march=native` might also give you some more simd extensions, not that the code is written in a way that could make use of simd — PeterT, Jun 24 '22 at 13:04
The huge gap between "Min Time" and "Max Time" in both cases, while you are always sorting the same array, is a huge indicator that something else running on your machine is causing the delays sometimes, therefore the results are not significant. — Erich Kitzmueller, Jun 24 '22 at 13:05
@ErichKitzmueller I ran the experiment several times and I always get the same results. I've also run them both with thread affinity, pinning the program to an isolated CPU core. Same results all the time. Whatever variance/jitter is being added to the C++ version is also being added to the Java version, leveling out the battle field. — SpeedChaser, Jun 24 '22 at 13:09
@tstanisl The average time of each measurement is almost 1000 nanoseconds. The time it takes to get the timestamp is around 15-25 nanoseconds, for both languages. — SpeedChaser, Jun 24 '22 at 13:11
@PeterT Same results with `-march=native`. But thanks for pointing that out! — SpeedChaser, Jun 24 '22 at 13:15
Compare the compiled assembly of the C++ code and the JIT-compiled assembly of the Java code. We have to look at the machine code output for any "scientific" or "logical" discussion. — xiver77, Jun 24 '22 at 13:15
C++ maybe slower than Java in some cases because Jave uses JIT which is guided by profiling data. Therefore Java VM may produce an assembly better suited to the actual input data than C++ compiler can. Does the performance gap stay the same if the array is populated with actually random data, not some decreasing sequence? — tstanisl, Jun 24 '22 at 13:17
@SpeedChaser *Note1: I am using -O3 (maximum optimization) with the C++ compiler.* -- There is no mention of the version of the compiler you're using. In addition, C++ is a language specification -- change compiler brands and/or compiler version, and you may get different results. -- *The short and simple complete source codes for each language are included* -- There is no such standard C++ function as `clock_gettime`. Use ``. — PaulMcKenzie, Jun 24 '22 at 13:19
@tstanisl Thanks! Problem is: if you populate with random data then you cannot compare because the time to sort each different/random array will be different. — SpeedChaser, Jun 24 '22 at 13:21
@SpeedChaser that's not entirely true. Good algorithms should show a determinable average speed independent of the concrete input datasets. — πάντα ῥεῖ, Jun 24 '22 at 13:23
@user17732522 You don't have to use `-O2`, just disable `-ftree-vectorize`, but in this case, `-O3` is likely to produce faster (but bloated) code. — xiver77, Jun 24 '22 at 13:25
A C++ programmer would use `std::swap`, and not write their own `swapping` function. That alone could cause a difference in timing. — PaulMcKenzie, Jun 24 '22 at 13:28
BTW gcc 7 is a quite old version these days (the latest is gcc 12). — xiver77, Jun 24 '22 at 13:28
"C++ is always faster than Java" - the real world is not that simple and this is not always true. — Jesper, Jun 24 '22 at 13:30
`timespec` **has** a field that holds nanoseconds; that does not mean that the **clock** has a **resolution** in nanoseconds. `if (res <= 0) res = 1;` in the code is a pretty good indicator that there are jitter problems. The results of these time measurements simply are not reliable. — Pete Becker, Jun 24 '22 at 13:30
@Jesper But there is always a way to make C(++) faster than Java. — xiver77, Jun 24 '22 at 13:31
Please, if you have any time to write the comments, show us the assembly output from gcc and JVM's JIT compiler. That's the only way to give you a clear answer. — xiver77, Jun 24 '22 at 13:39
Any time somebody states an absolute like "always faster" you already know it is wrong. — Goswin von Brederlow, Jun 24 '22 at 13:40
@Pete Becker I removed the line you did not like => same results. I don't believe your point is valid. The time measurements are reliable. If you disagree it would be nice to fix it and present new source code with results. Not saying you have to do that, but it would be nice :) — SpeedChaser, Jun 24 '22 at 13:43
Also when you have this much variance in the measurement, you might want to try and disable dynamic frequency scaling on your processor. To get some more consistent times. — PeterT, Jun 24 '22 at 13:43
@GoswinvonBrederlow Agree – in given case *'never slower'* might have some stand, though, as one might include in the C++ code the assembly generated by the JVM – and if java again beats C++ one might repeat ;) — Aconcagua, Jun 24 '22 at 13:46
@Aconcagua You can write both C++ and java code arbitrarily slow. That says more about the quality of your code and algorithms than the language. — Goswin von Brederlow, Jun 24 '22 at 13:53
In the inner loop array elements are swapped by copying one value to a temp, the second to the first, the temp to the second. I wonder if the compiler (both C++ and java) are smart enough to notice that if you swap twice in a row the temp variable doesn't need to be written and reloaded between the swap. You can do: `temp = a[i]; a[i] = a[i+1]; a[i+1] = a[i+2] ... a[i+k] = temp;` for a whole run of swaps. Or looping `[a[i], temp] = minmax(temp, a[i]);` — Goswin von Brederlow, Jun 24 '22 at 13:58
@PaulMcKenzie Java is also a language specification. If you change JVM brands and/or Java version you might also get different results. — David Conrad, Jun 24 '22 at 15:18
@DavidConrad -- Then that's more of a reason why (I believe) this entire exercise is a moot point. — PaulMcKenzie, Jun 24 '22 at 15:19
`-O3` doesn't always produce the fastest code. On x86_64 I've seen `-Os` (optimize code size) produce faster code at times. — David Conrad, Jun 24 '22 at 15:19
I think gcc does some really weird stuff with this code, I'm seeing like an order of magnitude difference on my machine between gcc11 and clang 14. I have gcc `Avg Time: 9077.12` clang `Avg Time: 656.601` and java 17 `Avg Time: 791.0` (with java having a slightly lower max time) — PeterT, Jun 24 '22 at 15:54
gcc 9 is giving me a more reasonable `Avg Time: 1149.49`, might be something not quite right in the gcc processor model again — PeterT, Jun 24 '22 at 15:57
@PeterT Thanks for trying with different versions. It has to be c++ so, **g++** compiler, correct? Can you point out the C++ compiler which gives you the best results? — SpeedChaser, Jun 24 '22 at 16:07
Yeah, I mean `g++-11` when I say gcc 11. The best result is what I got from clang 14 (`clang++-14`) — PeterT, Jun 24 '22 at 16:13
@Aconcagua -- whoops, my example was utter nonsense. Randomly starting timing tests doesn't give the result I claimed. Deleted. Nevertheless, measuring time intervals that are close to the resolution of the timer is inherently unreliable. — Pete Becker, Jun 24 '22 at 17:19
@PeteBecker Agree but I'm not doing this => *Nevertheless, measuring time intervals that are close to the resolution of the timer is inherently unreliable.* — SpeedChaser, Jun 24 '22 at 23:14
You probably don't need to write by hand in asm, just write C++ source that encourages the compiler to make better asm. e.g. use a temp variable from last iteration instead of accessing `data[j]`. Of course, *looking* at the compiler's asm output is a good way to see if your changes are having any effect, or having the desired effect. See [Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly?](https://stackoverflow.com/q/40354978) for some comments about hand-holding the compiler into making better asm vs. actually hand-writing asm. — Peter Cordes, Jun 25 '22 at 11:35

score 4 · Answer 1 · edited Jun 25 '22 at 07:26

In both cases the array is filled with numbers in descending order. And then bubble sorted so the code should behave differently. But the way you allocate the array is different.

In C++ you malloc. WHich just causes the kernel to record that you have requested some memory but doesn't map any physical memory to the addresses. So once the clock starts and you start initializing the array every page causes a page fault and then the kernel maps a physical page ad the right address.

In Java though you allocate and initialize the array to 0. This causes all the physical pages to be mapped and also the memory now is in the cpu cache. So when you start the clock the initialization of the array is much faster.

But I guess that is what the warmup should take care of.

That said your test method is flawed. The c++ compiler could optimize the whole loop away with the exception of the get_nano_ts() calls. So your C++ code would basically be

for(int i = warmup; i < iterations; i++) {
    long start = get_nano_ts(&ts);
    long end = get_nano_ts(&ts);
    x += n * (n-1) / 2;
    int res = end - start;
    if (res <= 0) res = 1;
    totalTime += res;
    minTime = min(minTime, res);
    maxTime = max(maxTime, res);
}

This would be very close to minTime = 1; maxTime = 1; totalTime = iterations - warmup;

Why do you count a time of 0 as 1? If the sorting doesn't even take a nanosecond you should abort because your test case is by far too small for the accuracy of your clock.

Lets discuss the results you measured a bit:

C++: Iterations: 9000000 | Avg Time: 1202 | Min Time: 1158 | Max Time: 212189
Java: Iterations: 9000000 | Avg Time: 837.0 | Min Time: 812 | Max Time: 37196

You sort the exact same array with exactly the same numbers 9000000 times. So the algorithm should behave the same every time and on it's own every single run should take the exact same time. And yet the time you measure differs by more than 2 orders of magnitudes. That is the sort took 200 times longer in some cases than others (40 times for java).

Lets see what happens when I repeat the test multiple times?

# g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
# ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 11155 | Min Time: 10950 | Max Time: 315173
# ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 11163 | Min Time: 10950 | Max Time: 234000
# ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 11320 | Min Time: 10950 | Max Time: 334208

Just doing multiple runs shows the max time to change by 50%. At least the Min Time and Avg. Time is relatively stable. So it seems to be that rarely the OS will interrupt the process and shuffle it to a different CPU core causing all the caches to be lost and then execution time sucks.

Lets play with the compiler flags a bit:

# g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O2
# ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 2320 | Min Time: 2194 | Max Time: 75442
# ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 2375 | Min Time: 2194 | Max Time: 199976

Hey, optimizing less and it's 5 times faster. Note that originally Java was just barely faster than c++. Surely C++ code must now beat the hell out of java.

Lets go even further:

# g++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -Os
./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 2447 | Min Time: 2254 | Max Time: 234702

Optimizing for size barely makes a difference in speed, if at all. I would say it's below the level of noise. Might be just an artefact of different cache alignment of the code or something.

Or lets try clang++:

# clang++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -Os
# ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 2177 | Min Time: 2104 | Max Time: 189857

# clang++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O2
# ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1479 | Min Time: 1293 | Max Time: 236334

# clang++ TimeBubbleSort.cpp -o TimeBubbleSort -std=c++11 -O3
# ./TimeBubbleSort 10000000 1000000 60
Value computed: 18300000000
Iterations: 9000000 | Avg Time: 1080 | Min Time: 1011 | Max Time: 135706

Reading back over my answer I totally missed pointing out that with gcc frequently -O3 code is slower than -O2. For the most part the reason a lot of optimizer options are in -O3 is that they are generally not faster. Otherwise they would be in -O2. (Excluding any experimental optimization that isn't considered stable enough yet).

Don't use -O3 unless you have tested that it is actually benefittial and then be very selective which part of the code you compile with -O3.

Looking at the clang++ output makes me rethink this. Different compiler, different optimizer, different behavior for -Os / -O2 / -O3.

Now the real work begins: What code do the compilers generate that make such a difference? 5 times slower for "gcc -O3" and twice as fast for "clang++ -O3".

For my GCC11, the answer is Bubble sort slower with -O3 than -O2 with GCC The -O3 slowdown here is a pretty specific anti-optimization that would often help or at least not hurt much, but here it hurts a lot in a Bubble Sort that doesn't keep array[j+1] around in a temporary to be next iteration's array[j]. Instead reloading it from memory as part of a pair of loads that it does with one wide load, creating a store-forwarding stall.

Your GCC version doesn't have that problem, though, only GCC11 and newer. So don't expect a big speedup; your GCC7 -O3 should already be making asm with no major problems, except for possible things like How can I mitigate the impact of the Intel jcc erratum on gcc? if you're using a Skylake CPU.

(Store and reload of both elements will still create a loop-carried dependency when bubbling an element from one of the array to the other, though, worse than just updating a register for the next iteration.)

Whatever clang is doing is better than GCC's best version, though, so you can probably get a big speedup with that.

Do you mind fixing my flawed test method? What would you write instead? I would be happy to re-run my tests with your modification. But *NO*, the compiler will not optimize away the loop. It is an operation (sorting) not a math calculation. — SpeedChaser, Jun 24 '22 at 13:58
@SpeedChaser You have to trick the optimizer with something like `volatile asm("" ::: memory);` or similar construct to forcefully block optimization at the right points. That your c++ compiler does not optimize the whole bubblesort away is a shortfall of the optimizer implementation. It could if it would invest enough time and brains into it. You also should measure noise from different memory layouts and stack randomization. Currently you measure one thing a million times once. The next time you run it stack randomization could make the code 40% faster or slower. — Goswin von Brederlow, Jun 24 '22 at 14:06
If the C++ compiler optimising the loop away was indeed the reason then C++ should be much faster, shouldn't it? Would the *Java JIT* compiler possibly do so while C++ compiler *not*? — Aconcagua, Jun 24 '22 at 14:13
yes and yes. You now are comparing the intelligence of the optimizers. Althought JIT tends to do less optimizing because spending 1 hour just-in-time compiling something would be insane. So I would expect C++ to optimize the loop away before java as optimizers become more clever. Remember I said **could**. Not that it will. You actually have to look at the compiler output for both C++ and java to see what the compiler actually does to your code. — Goswin von Brederlow, Jun 24 '22 at 15:02
Randomizing is not possible because then effectively you will be measuring two different sorting operations. The array elements for Java and the array elements for C++ must be the same for equal sorting time. We can talk here for days, weeks, months, years (not centuries). **Isn't it better to just change the code and/or compiler options to just PROVE someone's point?** I've taken the time to state my point with code. — SpeedChaser, Jun 24 '22 at 15:22
@SpeedChaser randomizing the stack and memory layout. Not the data. — Goswin von Brederlow, Jun 24 '22 at 15:28
I'm discarding the first 1 million runs, so this point is not valid => *So once the clock starts and you start initializing the array every page causes a page fault and then the kernel maps a physical page ad the right address.* — SpeedChaser, Jun 24 '22 at 15:53
Nice work man! Can you point out the C++ compiler you have used which gives you the best results? With the corresponding compiler options? On my side, when I compile with `-O2` I get very similar results to `-O3`. — SpeedChaser, Jun 24 '22 at 16:10
I don't have enough compiler versions installed (or time) to say which is best. I just hope it comes across that the way you use the compilers and how you measured has >10 times the effect than the choice of language. I'm sure you can create similar variance by using different options for java. And I haven't even done work on excluding environmental factors. You really have to fuzz with the stack, heap and code alignment to benchmark code and then calculate the probability of the null hypothese. — Goswin von Brederlow, Jun 24 '22 at 16:15
You said it is faster with `-O2`. For me it was slower (see my edit). It makes sense to assume **latest version** for both Java and C++. I'm trying to install the latest version of `g++` on my machine. — SpeedChaser, Jun 24 '22 at 16:16
@SpeedChaser: BubbleSort being slower with `-O3` than `-O2` is a specific case of GCC shooting itself in the foot. [Bubble sort slower with -O3 than -O2 with GCC](https://stackoverflow.com/q/69503317) applies to recent GCC versions like Goswin is using (specifically GCC11 and 12), but not to GCC7.5. `-O3` *is* generally supposed to be faster, but this is a GCC missed-optimization bug. (Or anti-optimization). — Peter Cordes, Jun 24 '22 at 18:35
At least for my compiler `-O2` was not slower than `-O3`. And both are slower than Java. I'm trying to get the latest version of g++ and clang compilers to test. — SpeedChaser, Jun 24 '22 at 21:00
@SpeedChaser: That's expected; my last comment explains why Goswin's GCC11 sees a big slowdown for `-O3`, but your GCC7 makes about the same asm for this with `-O2` and `-O3`, not introducing store-forwarding stalls. For clang to run faster, maybe it's keeping a value in a register, noticing that `A[i+1]` this iteration is next iteration's `A[i]`. That might make a big difference to BubbleSort performance, especially when a lot of elements have to move far in one direction (swapping every iteration of the inner loop) by shortening the dependency chain. — Peter Cordes, Jun 24 '22 at 23:26
@SpeedChaser: Remember that CPUs run machine code; it all comes down to what asm / machine code Java or any GCC version makes for the loops. (Assuming warm-up issues have been taken care of.) It doesn't matter how a given sequence of machine code is generated, only that it exists in memory. So I wouldn't spend a huge amount of time benchmarking different compiler versions on this one very specific problem; quirks of different code-gen choices that happen to be good or bad for some microarchitectural detail of your CPU in this exact code won't tell you much in general. — Peter Cordes, Jun 24 '22 at 23:32
@SpeedChaser: Of course if you want to actually look at the generated asm and https://agner.org/optimize/, and HW performance counters like `idq_uops_not_delivered.core` front-end bottlenecks or `resource_stalls.any` with `perf stat`, then you could maybe see what asm is more or less efficient on your specific CPU. Also the JCC erratum can be highly relevant for Intel Skylake CPUs; [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) — Peter Cordes, Jun 24 '22 at 23:34
@GoswinvonBrederlow: I edited your answer to include the fact that the OP's GCC7 doesn't have anything to gain from -O2 vs. -O3, the -O3 slowdown is a known thing and limited to GCC11 and later. — Peter Cordes, Jun 25 '22 at 07:30
I didn't edit the parts that tell people not to use `-O3`, even though I disagree with them. It used to be that `-O3` was likely to cause slowdowns, but these days that's rarer. A long time ago, GCC -O3 used to include `-funroll-loops`, but now that's only enabled by default with PGO. And with SIMD being more beneficial more often (with wider CPUs), it's often good to enable that. (Although GCC12 and later enable `-ftree-vectorize` even at `-O2`, because of SIMD being more worthwhile.) Not a bad idea to check if `-O2` is faster, but your answer is overly negative about `-O3`. — Peter Cordes, Jun 25 '22 at 07:31
@PeterCordes I don't believe so. As you just demonstrated with `-ftree-vectorize` in gcc stuff that's generally benefittial and has proven out for some time in `-O3` gets moved to `-O2`. The way you should operate is to default to `-O2` and check if `-O3` is faster where it matters. And the most important word in that sentence is **check**. — Goswin von Brederlow, Jun 25 '22 at 08:32

score 2 · Answer 2 · answered Jun 26 '22 at 08:45

2

Given how good JIT is to resolve and fire dead-code-elimination I strongly suggest to rewrite both benchmarks using proper benchmark harness tools eg https://github.com/openjdk/jmh for Java side.

answered Jun 26 '22 at 08:45

Francesco Nigro

173
4

I know people are busy and stuff, but would you mind doing that and posting the code here for the community. However **I really don't think any code is being eliminated by the JIT**. – SpeedChaser Jun 26 '22 at 11:00
1

I agree 100% with Franseco. Writing microbenchmarks is very hard and the JIT is very smart at removing code that doesn't matter. That is why JMH has support for blackholes that prevent the JIT from removing code. Especially if you get weird results, the first thing I will distrust is the benchmark. – pveentjer Jun 26 '22 at 11:33
But I'm not getting weird results. The results are consistent. The array is always sorted correctly. Moreover I don't see the point: **JIT is optimizing and It is free to do whatever it wants as much as the C++ compiler is.** We can talk here for hours, days, weeks, years, decades (but not centuries). It would be nice to prove your point with code and results. I've taken the time to write and provide full source codes. – SpeedChaser Jun 26 '22 at 12:00
What is the purpose of 'x'? Since the side effects of changes to that variable are not used, the JIT can remove that logic (including the loop). – pveentjer Jun 26 '22 at 12:13
For the same reason, the whole array sorting can be optimized out. – pveentjer Jun 26 '22 at 12:18
1

`x` is being printed out, so we can be sure both programs are generating the same output. Therefore no, you are incorrect. The whole array sorting will not be optimized out. If that was the case then the program would execute in less than 2 nanoseconds. You missed the line that prints `x` to stdout. Rather than trying to discuss I would suggest you modify the code to prove me wrong. It is easier, simpler and quicker for both parties. My code has being tested and reviewed by a lot of people, so no, there is nothing wrong with the both codes. Nothing is being removed by te compiler unfairly. – SpeedChaser Jun 26 '22 at 14:13
1

The whole point of using JMH is not just about dead code elimination, but to have a reliable and correct env where warmup is correctly accounted, no constant folding is happening etc etc and most importantly, where you can easily inspect assembly to find what's going on for real. – Francesco Nigro Jun 26 '22 at 17:32
*Did you read the short source code provided?* Again there is no dead code elimination taking place and **I AM DOING WARMUP**, the first 1 million measures are discarded. I think JMH is unnecessary and totally outside the scope of this question, which is why Java performs better than C++ for bubble sort. If you disagree and want to prove this point wrong, you must provide an equivalent C++ implementation, with your chosen compiler and optimization compiler options. I would be happy to run on my side and compare with the Java alternative. – SpeedChaser Jun 26 '22 at 18:51
1

If you really want to know why under your particular example the Java code is faster than the C++ code I would suggest using some kind of profiler like perf/vtune (both can be used for Java and of course C++) and see what is going on. Also, have a look at the generated Assembly of both Java and C++. – pveentjer Jun 27 '22 at 00:53

Why Java turns out to be faster than C++ in this simple BubbleSort benchmark example?

2 Answers2