Why is this C++ code execution so slow compared to java?

Question

I recently wrote a computation-intensive algorithm in Java, and then translated it to C++. To my surprise the C++ executed considerably slower. I have now written a much shorter Java test program, and a corresponding C++ program - see below. My original code featured a lot of array access, as does the test code. The C++ takes 5.5 times longer to execute (see comment at end of each program).

Conclusions after 1^st 21 comments below ...

Test code:

g++ -o ... Java 5.5 times faster
g++ -O3 -o ... Java 2.9 times faster
g++ -fprofile-generate -march=native -O3 -o ... (run, then g++ -fprofile-use etc) Java 1.07 times faster.

My original project (much more complex than test code):

Java 1.8 times faster
C++ 1.9 times faster
C++ 2 times faster

Software environment:
    Ubuntu 16.04 (64 bit).
    Netbeans 8.2 / jdk 8u121 (java code executed inside netbeans)
    g++ (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
    Compilation: g++ -o cpp_test cpp_test.cpp

Java code:

public class JavaTest {
    public static void main(String[] args) {
        final int ARRAY_LENGTH = 100;
        final int FINISH_TRIGGER = 100000000;
        int[] intArray = new int[ARRAY_LENGTH];
        for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
        int i = 0;
        boolean finished = false;
        long loopCount = 0;
        System.out.println("Start");
        long startTime = System.nanoTime();
        while (!finished) {
            loopCount++;
            intArray[i]++;
            if (intArray[i] >= FINISH_TRIGGER) finished = true;
            else if (i <(ARRAY_LENGTH - 1)) i++;
            else i = 0;
        }
        System.out.println("Finish: " + loopCount + " loops; " +
            ((System.nanoTime() - startTime)/1e9) + " secs");
        // 5 executions in range 5.98 - 6.17 secs (each 9999999801 loops)
    }
}

C++ code:

//cpp_test.cpp:
#include <iostream>
#include <sys/time.h>
int main() {
    const int ARRAY_LENGTH = 100;
    const int FINISH_TRIGGER = 100000000;
    int *intArray = new int[ARRAY_LENGTH];
    for (int i = 0; i < ARRAY_LENGTH; i++) intArray[i] = 1;
    int i = 0;
    bool finished = false;
    long long loopCount = 0;
    std::cout << "Start\n";
    timespec ts;
    clock_gettime(CLOCK_REALTIME, &ts);
    long long startTime = (1000000000*ts.tv_sec) + ts.tv_nsec;
    while (!finished) {
        loopCount++;
        intArray[i]++;
        if (intArray[i] >= FINISH_TRIGGER) finished = true;
        else if (i < (ARRAY_LENGTH - 1)) i++;
        else i = 0;
    }
    clock_gettime(CLOCK_REALTIME, &ts);
    double elapsedTime =
        ((1000000000*ts.tv_sec) + ts.tv_nsec - startTime)/1e9;
    std::cout << "Finish: " << loopCount << " loops; ";
    std::cout << elapsedTime << " secs\n";
    // 5 executions in range 33.07 - 33.45 secs (each 9999999801 loops)
}

I compiled with "g++ -o cpp_test cpp_test.cpp". Could I modify that somehow? — Bill B, Jun 03 '17 at 11:41
@BillB Yes, for release code you should at least turn optimization on: `g++ -O3 -o cpp_test cpp_test.cpp` However on my system `Java` is still a little faster than `c++` for this test and I'd also like to know why :) It may be something like the `Java` optimizer noticed an opportunity that the `C++` optimizer didn't, it is overly simple code after all — Galik, Jun 03 '17 at 11:53
For some reason `clang++` does really badly in this test. I get `Java = 9 secs`, `g++ = 14 secs` and `clang++ = 21 secs`. — Galik, Jun 03 '17 at 12:08
@Galik What version of GCC and Clang did you use? Maybe the JRE noticed an oppurtunity to use vector instructions. Most C++ compilers won't vectorize or use instruction set extensions unless you explicitly tell them to. JRE can probably check available instruction sets at runtime. — tambre, Jun 03 '17 at 13:26
@tambre I'm using `GCC 6.3.1` and `clang 3.9.1` but I changed the array to a stack array rather than a dynamically allocated one. — Galik, Jun 03 '17 at 13:45
Try outputting `intArray` to stdout at the very end in both Java and C++ as the optimizer may remove array-related code if that doesn't change observable behaviour. — zett42, Jun 03 '17 at 14:40
That might be the cause. Not quite sure of the exact effects in Java, but maybe adding `volatile` will change how the optimizer behaves? Also, would be nice if the C++ version used `std::chrono` so it can be compiled also on non-Unix systems. — tambre, Jun 03 '17 at 14:47
I tried this with Java vs clang and for me clang is faster on your sample code. Considering the difference you observed, you probably forgot to activate optimizations for c++ — Guillaume Gris, Jun 03 '17 at 18:29
@Guillaume Gris: which clang version and O level? Java is faster for me compared with clang 3.5 — lalala, Jun 03 '17 at 19:08
c++ code on msvc with max optimizations on 64bit intel core quad q6600 @ 2.6GHz, 15-16s. I was surprised though to see that changing you boolean variable to break makes it 50% slower. — Sopel, Jun 03 '17 at 19:13
@lalala java 8 and clang 3.8 -o2. To be clear, the difference I observe is actually quite small, as one would expect on such an example — Guillaume Gris, Jun 03 '17 at 21:47
@Galik Thanks. Yes -O3 makes a significant difference: JavaTest now only 2.9 times faster that cpp_test. But unfortunately it makes no perceptible difference in my original project. There, java remains at 1.8 times faster. As you say, my test code is overly simple. — Bill B, Jun 03 '17 at 22:08
@tambre By all means change the C++ version to use std::chrono if you wish. — Bill B, Jun 03 '17 at 22:21
Using profiling information (which java does by default) lets g++ match the speed of java, for the version using break instead of the variable `finished`. — Marc Glisse, Jun 03 '17 at 22:32
@BillB compile once adding the flag -fprofile-generate, then run the program, then recompile with -fprofile-use, then run again. By the way you may also want to add `-march=native`. — Marc Glisse, Jun 03 '17 at 22:36
@Marc Glisse: I did "g++ -fprofile-generate -march=native -o3 -o cpp_test cpp_test.cpp"; run; above with -fprofile-use; run. Final execution time was the same as my original compilation (see near top of page), and much slower than using only -o3 — Bill B, Jun 03 '17 at 22:55
It is `-O3` with a capital O. And (not sure if you did that right) -fprofile-use replaces -fprofile-generate. — Marc Glisse, Jun 03 '17 at 22:58
@Marc Glisse: Yes - that makes cpp_test execution essentially the same as JavaTest. And in my original project, it makes C++ twice as fast as java! Thank you very much. (I had -fprofile-use correct, but was confused re -O3) — Bill B, Jun 03 '17 at 23:16
@Galik: In my previous reply to you I was mistaken about the effect of -O3 on my original code (I mistakenly thought that -o3 and -O3 were interchangeable). See "conclusions after ..." above, since added to my original post. — Bill B, Jun 04 '17 at 06:20
@BillB put code and command line options in `backticks` like this to make it readable. And you should also try `-O2` because [`-O3` is not always faster](https://stackoverflow.com/q/28875325/995714). It does [more inlining](https://stackoverflow.com/q/5637828/995714) and some other aggressive optimizations which sometimes make cache miss occurs more often. Also try [enabling autovectorization in GCC](https://stackoverflow.com/a/409302/995714) — phuclv, Jun 04 '17 at 07:52
@LưuVĩnhPhúc `-O3 -march=native` (and `-ffast-math` if handling floats) is a more efficient way to enable autovectorization than your link. In your other link, using profile information removes the (random) advantage of -O2 over -O3. — Marc Glisse, Jun 04 '17 at 08:19

score 5 · Answer 1 · answered Jun 04 '17 at 10:53

The only time I could get the C++ program to outperform Java was when using profiling information. This shows that there's something in the runtime information (that Java gets by default) that allows for faster execution.

There's not much going on in your program apart from a non-trivial if statement. That is, without analysing the entire program, it's hard to predict which branch is most likely. This leads me to believe that this is a branch misprediction issue. Modern CPUs do instruction pipelining which allows for higher CPU throughput. However, this requires a prediction of what the next instructions to execute are. If the guess is wrong, the instruction pipeline must be cleared out, and the correct instructions loaded in (which takes time).

At compile time, the compiler doesn't have enough information to predict which branch is most likely. CPUs do a bit of branch prediction as well, but this is generally along the lines of loops loop and ifs if (rather than else).

Java, however, has the advantage of being able to use information at runtime as well as compile time. This allows Java to identify the middle branch as the one that occurs most frequently and so have this branch predicted for the pipeline.

Does adding a compiler-specific hint for the unlikeliness for the first if-statement help? — tambre, Jun 04 '17 at 11:04
@Dunes Modern CPUs in fact do a lot of branch prediction. They keep a history of conditional jumps: https://en.wikipedia.org/wiki/Branch_predictor#Global_branch_prediction — Andriy Berestovskyy, Jun 05 '17 at 07:24
@tambre when I tried, adding __builtin_expect on the various `if` did not help significantly (depending on the value guessed, and I tried all combinations, it sometimes hurt a little). — Marc Glisse, Jun 05 '17 at 10:59

rustyx · Answer 2 · 2017-06-07T13:22:34.123

Somehow both GCC and clang fail to unroll this loop and pull out the invariants even in -O3 and -Os, but Java does.

Java's final JITted assembly code is similar to this (in reality repeated twice):

    while (true) {
        loopCount++;
        if (++intArray[i++] >= FINISH_TRIGGER) break;
        loopCount++;
        if (++intArray[i++] >= FINISH_TRIGGER) break;
        loopCount++;
        if (++intArray[i++] >= FINISH_TRIGGER) break;
        loopCount++;
        if (++intArray[i++] >= FINISH_TRIGGER) { if (i >= ARRAY_LENGTH) i = 0; break; }
        if (i >= ARRAY_LENGTH) i = 0;
    }

With this loop I'm getting exact same timings (6.4s) between C++ and Java.

Why is this legal to do? Because ARRAY_LENGTH is 100, which is a multiple of 4. So i can only exceed 100 and be reset to 0 every 4 iterations.

This looks like an opportunity for improvement for GCC and clang; they fail to unroll loops for which the total number of iterations is unknown, but even if unrolling is forced, they fail to recognize parts of the loop that apply to only certain iterations.

Regarding your findings in a more complex code (a.k.a. real life): Java's optimizer is exceptionally good for small loops, a lot of thought has been put into that, but Java loses a lot of time on virtual calls and GC.

In the end it comes down to machine instructions running on a concrete architecture, whoever comes up with the best set, wins. Don't assume the compiler will "do the right thing", look and the generated code, profile, repeat.

For example, if you restructure your loop just a bit:

    while (!finished) {
        for (i=0; i<ARRAY_LENGTH; ++i) {
            loopCount++;
            if (++intArray[i] >= FINISH_TRIGGER) {
                finished=true;
                break;
            }
        }
    }

Then C++ will outperform Java (5.9s vs 6.4s). (revised C++ assembly)

And if you can allow a slight overrun (increment more intArray elements after reaching the exit condition):

    while (!finished) {
        for (int i=0; i<ARRAY_LENGTH; ++i) {
            ++intArray[i];
        }
        loopCount+=ARRAY_LENGTH;
        for (int i=0; i<ARRAY_LENGTH; ++i) {
            if (intArray[i] >= FINISH_TRIGGER) {
                loopCount-=ARRAY_LENGTH-i-1;
                finished=true;
                break;
            }
        }
    }

Now clang is able to vectorize the loop and reaches the speed of 3.5s vs. Java's 4.8s (GCC is unfortunately still not able to vectorize it).

Knowing when unrolling is worth it is a hard problem. Profiling information helps gcc make that decision. However, unrolling is currently done very late in the optimization pipeline, too late to reap some of the benefits (that's a known limitation in gcc). Here, replacing `finished=true` with `break` accidentally helps a bit, though the code is still not as nice as the one produced by the JVM. — Marc Glisse, Jun 05 '17 at 12:01
*"not as nice as the one produced by the JVM"*?? Do you consider the JVM version nice? And no, `finished` had no effect on performance whatsover. — rustyx, Jun 05 '17 at 12:14
The JVM version you link to doesn't look so bad, why, do you think it is ugly? `finished` does have an effect for me, with profiling enabled in all cases, it decreases the running time from >6s to <4s, but it could easily be different for a different version of the compiler. — Marc Glisse, Jun 05 '17 at 12:38
I dont understand how java could even be close. I mean the virtual machine has to read each instruction and the perform the action. Apart from cache misses due to that it shd be significantly slower. Or does java compile byte code to machine code on the fly? — lalala, Jun 05 '17 at 14:25

Why is this C++ code execution so slow compared to java?

2 Answers2

Linked