7

I do not want to start yet another pointless flamewar about whether Java or C++ is the better language in general. I want to know whether a comparison I did for a specific task is fair and the measured data correct.

We need to decide whether to use Java or C++ for our next project. I'm in the C++ camp but I want to have solid arguments for my case. Our application is a special and has the following needs:

  • The program must run reasonably fast and be reasonably memory efficient. We do not care about the last 20% performance. However, a 10x performance difference is a show stopper.
  • We have lots of arrays. We do not know their size upfront. It is therefore important that arrays can grow at the back in amortized O(1) running time.
  • The elements in the arrays consist of a small number of basic data types. The typical example is a tuple of integers or floats.
  • The arrays can get large. 10^6 elements are standard. We have applications with 10^7 elements and supporting 10^8 would be great.

I implemented a toy program in C++ and in Java. First, I present the C++ version:

#include <iostream>
#include <vector>
#include <cstdlib>
using namespace std;

struct Point{
        float x, y;
};

int main(int argc, char*argv[]){
        int n = atoi(argv[1]);

        vector<Point>arr;

        for(int i=0; i<n; ++i){
                Point p;
                p.x = i;
                p.y = i+0.5f;
                arr.push_back(p);
        }

        float dotp = 0;
        for(int i=0; i<n; ++i)
                dotp += arr[i].x * arr[i].y;

        cout << dotp << endl;
}

Next is the Java version that does the same thing:

import java.util.*;

class Point{
        public float x, y;
}

class Main{
        static public void main(String[]args){
                int n = Integer.parseInt(args[0]);

                ArrayList<Point> arr = new ArrayList<Point>();

                for(int i=0; i<n; ++i){
                        Point p = new Point();
                        p.x = i;
                        p.y = i+0.5f;
                        arr.add(p);
                }

                float dotp = 0;
                for(int i=0; i<n; ++i)
                        dotp += arr.get(i).x * arr.get(i).y;

                System.out.println(dotp);
        }
}

I pass the number of elements using the commandline to the program to prevent the optimizer from executing the program during compilation. The computed value is not useful. The only interesting question is how fast the programs run and how much memory they use. I start with C++:

$ g++ --version
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
$ g++ -O3 test.cpp -o test
$ /usr/bin/time ./test 1000000
3.33381e+17
0.01user 0.00system 0:00.02elapsed 100%CPU (0avgtext+0avgdata 10084maxresident)k
0inputs+0outputs (0major+2348minor)pagefaults 0swaps
$ /usr/bin/time ./test 10000000
3.36984e+20
0.08user 0.01system 0:00.09elapsed 100%CPU (0avgtext+0avgdata 134380maxresident)k
0inputs+0outputs (0major+4074minor)pagefaults 0swaps
$ /usr/bin/time ./test 100000000
2.42876e+23
0.77user 0.09system 0:00.87elapsed 99%CPU (0avgtext+0avgdata 1050400maxresident)k
0inputs+0outputs (0major+6540minor)pagefaults 0swaps

The "user" time is how long the program ran. For 10^6 elements, it ran for 0.01 sec, for 10^7 elements 0.08 sec, and for 10^8 elements 0.77 sec. "maxresident" is the amount of physically memory in kilobyte that the kernel gave the program. For 10^6 its 10 MB, for 10^7 its 132 MB, and for 10^8 its 1 GB.

The memory consumption sounds right. An array with x elements needs sizeof(float)*2*x=8*x bytes of memory. For 10^6 elements that is about 8MB, for 10^7 about 76MB, and for 10^8 about 762 MB.

Next, I run the Java program:

$ javac -version
javac 1.6.0_41
$ javac Main.java
$ java -version
java version "1.7.0_131"
OpenJDK Runtime Environment (IcedTea 2.6.9) (7u131-2.6.9-0ubuntu0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.131-b00, mixed mode)
$ /usr/bin/time java Main 1000000
3.33381168E17
0.16user 0.00system 0:00.09elapsed 173%CPU (0avgtext+0avgdata 79828maxresident)k
0inputs+64outputs (0major+4314minor)pagefaults 0swaps
$ /usr/bin/time java Main 10000000
3.3698438E20
5.23user 0.18system 0:02.07elapsed 261%CPU (0avgtext+0avgdata 424180maxresident)k
0inputs+64outputs (0major+13508minor)pagefaults 0swaps
$ /usr/bin/time java Main 100000000
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at Main.main(Main.java:14)
Command exited with non-zero status 1
3840.72user 13.06system 17:11.79elapsed 373%CPU (0avgtext+0avgdata 2281416maxresident)k
0inputs+1408outputs (0major+139893minor)pagefaults 0swaps

For 10^6 elements it needs 0.16 sec and 78 MB. For 10^7 elements it needs 5.23 sec and 414 MB. I tried to run the program for 10^8 elements but Java crashed. It used all cores of my machine (in a sequential program!) and ran for 17 min while occupying 2.2GB. My machine has 8 GB of memory.

For 10^6 elements C++ is 0.16 / 0.01 = 16 times faster and needs 78/10 = 7.8 times less memory. For 10^7 elements C++ is 5.23/0.08 = 65 times faster and needs 414/132 = 3.1 times less memory. Java did not finish on the test instance with 10^8 elements while the C++ program finished within well below a second.

For 10^6 Java seems manageable but less than ideal. For 10^7 and 10^8 it is an absolute no-go. I expected a slight performance advantage of C++ over Java but not something this drastic.

The most likely explanation is that my testing methodology is wrong or that I have a non-obvious performance bottleneck in my Java code. Another explanation would be that the OpenJDK JVM significantly lacks behind JVMs from other vendors.

Please explain to me why Java performs so bad in this benchmark. How did I unintentionally make Java look worse than it is?

Thanks

B.S.
  • 1,435
  • 2
  • 12
  • 18
  • Depending on `n`, the c++ version could be faster if you use `arr.reserve(n)`, or declare the vector as `vectorarr(n);` and use the `[]` operator instead of `push_back`. – juanchopanza Jun 01 '17 at 20:58
  • See: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – Jorn Vernee Jun 01 '17 at 21:03
  • @juanchopanza In our application, we have many arrays of which we do not know the size upfront. I therefore intentionally wrote the benchmark to not assume this. – B.S. Jun 01 '17 at 21:04
  • 3
    One thing that makes the C++ version so much faster is that the `Point` objects are stored *directly in the vector's heap allocation* whereas in the Java version, each `Point` object requires a totally separate heap allocation. I would suggest doing another benchmark using C# with a `struct Point` (not `class Point`) and a `List` as that more closely mirrors what's happening in your C++ code -- the `Point` objects would be allocated in a single array instead of as separate heap objects. – cdhowie Jun 01 '17 at 21:05
  • 6
    A pretty obvious and big difference here is that `vector` is just a bunch of points but `ArrayList` is a bunch of *references* to points. You could use `int[]`/`float[]`/etc (use only arrays of primitives) to emulate direct-data but of course you'd have to grow it manually.. – harold Jun 01 '17 at 21:05
  • If using C# is reasonable in your environment (consider using Mono if you need to be portable to non-Windows) then C#'s more reasonable generic array behavior with value types will be a tremendous advantage. – cdhowie Jun 01 '17 at 21:05
  • @B.S. Then you can use `n` as a parameter to optimize. A default constructed vector will not be optimal for most real life scenarios. – juanchopanza Jun 01 '17 at 21:06
  • I mean...I hate to be so basic about it, but Java is always going to perform worse, in the absolute best case scenario (assuming the code written was done so competently) you're going to match C++ in terms of speed. C++ goes straight to assembly, so it doesn't have to be jitted from byte-code, Java does, and it is extremely inefficient, but that's how you keep it cross platform. It's just the price you pay. – Trevor Hart Jun 01 '17 at 21:07
  • @JornVernee The total running time for 10^6 elements is an upper bound on the warm up time. The warmup time for 10^7 should be the same. This leaves about 5 sec of computation time for Java. – B.S. Jun 01 '17 at 21:07
  • 5
    @TrevorHart no, that's not true at all. A good JIT can beat statically compiled code in many workflows. C++ gives *control* over performance, not performance. Even profile-guided optimization is still static compilation and is only optimized for certain parameters, where as a JIT can adapt on the fly to differing workflows. – xaxxon Jun 01 '17 at 21:07
  • 1
    You should try it with Java 8. You may find that it has a better JIT and more efficient garbage collection. Even if it hasn't, Java 8 is the only one that Oracle currently publicly supports. Java 1.6 went out of support years ago. Why are you still using it? – Klitos Kyriacou Jun 01 '17 at 21:11
  • @xaxxon that's only true in cases where the JIT has been highly optimized for specific circumstances, in general though, if you took straight c++ and straight Java, C++ pretty much always wins or ties, and in the case where you highly optimize a JIT compiler, you can just do the same thing for C++, where you write a highly optimized C++ compiler and highly optimized C++ code, if you compile down to the exact same instruction set, C++ will still win because it skips the entire jitting phase, with a JIT you're literally taking the same code, but adding runtime overhead on top of it. – Trevor Hart Jun 01 '17 at 21:12
  • @cdhowie No, C# is not on the list of options. – B.S. Jun 01 '17 at 21:14
  • @TrevorHart regardless, none of that explains the magnitude of the differences involved. However, having what is essentially a vector of pointers to objects in java, compared to actually having the memory contiguous in C++ does, though. – xaxxon Jun 01 '17 at 21:15
  • @B.S. Java JITs things on a per-method basis, so your code will never get JITted (aside from `Point::new` and `ArrayList::add` and `ArrayList::get`). You only run `main` once, so it will never be a hotspot. – Jorn Vernee Jun 01 '17 at 21:15
  • But a realistic scenario _will_ involve collections of aggregates and Java will suffer due to cache locality. – Aluan Haddad Jun 01 '17 at 21:16
  • @TrevorHart I do not think that byte code vs native code is the problem here. I think that the problem is that the Java language is not expressive enough to support efficient arrays that can grow at the end. The problem would persist even if Java was compiled to native code. – B.S. Jun 01 '17 at 21:17
  • re-run your test with a vector> instead and see what happens. That should more closely approximate what java is doing and give you a better sense of why there is a difference in performance. But this highlights one of the great things about C++ -- the ability to closely control the performance characteristics of your program. – xaxxon Jun 01 '17 at 21:17
  • Regarding your requirement "It is therefore important that arrays can grow at the back in amortized O(1) running time." -- you do realize that, for both vector and ArrayList, every time it resizes it has to copy N elements to the new, bigger array? So not quite O(1). On the other hand, std::deque adds to the back without copying the existing elements, so it may be worth checking that out. – Klitos Kyriacou Jun 01 '17 at 21:19
  • 3
    @KlitosKyriacou that's what "amortized" means. Because the growth factor of the vector, it's still constant time. – xaxxon Jun 01 '17 at 21:19
  • Commenting out the calculation of `dotp` doesn't speed up the Java at all, and changing from an `Arraylist` to a `Point[]` barely speeds it up either. It seems most of the time in your program is being spent creating all of the `Point` objects, rather than the `ArrayList` itself being at fault. – azurefrog Jun 01 '17 at 21:23
  • Does B.S. stand for Bjarne Stoustrup? ;-) – Tristan Brindle Jun 01 '17 at 21:24
  • @TrevorHart Thanks for pointing out that the code only gets JITed when entering a function. I ran a second test, where I put the loop into a function. I first ran the function for 10^5 elements 10 times to warm up the JVM and then once using 10^7 elements. The results are very similar but the running times slightly decreased to 4.7 sec. The JIT not running thus explains a small part of the effect but not the primary slowdown. – B.S. Jun 01 '17 at 21:27
  • 1
    did you re-run the C++ with vector> like I suggested? That will cause a memory allocation per object, which Java has to do, as well. Also, creating a type that is not trivially constructible will hurt the C++ timings as well -- and is important to test if your actual production data isn't trivially constructible. – xaxxon Jun 01 '17 at 21:28
  • Please note, I'm not suggesting you would actually write production C++ code to intentionally limit data locality -- just saying that it may help make more sense of the java timings you're seeing. – xaxxon Jun 01 '17 at 21:36
  • @xaxxon With vector> C++ is still faster but the running times are significantly closer. So the explanation is indeed the additional pointer indirection. The JIT optimizer is not capable of getting rid of it even in this toy example. The question whether this is a fair comparison remains. Is there some way to get Java to use a sane memory layout through a sane interface? I do not consider resorting to a float[] and managing member offsets and the array doubling by hand at each invocation site to be sane. That is ASM-level micro management of the data structures. – B.S. Jun 01 '17 at 21:40
  • 2
    With parallel arrays you at least don't have "magic offsets" but just good old field-names (except the field is an array instead of the other way around if you know what I mean). And you can encapsulate them to hide the growing. – harold Jun 01 '17 at 21:46
  • 1
    No, the explanation is that having to do N allocations is awful. If you're not doing a lot of allocations once the data is initially built up (if your application is mostly queries over the entire running time of your application), then this isn't very meaningful. However, if you are accessing your data in sequence, having the data sequential in memory can give a massive speedup during queries. – xaxxon Jun 01 '17 at 22:13
  • Why don't you run Java version in profiler and see what takes the most time before making final decision. Some quick performance tunup suggestions: 1)Try to call arr.get(i) only once. 2) Try to adjust ArrayList initial capacity. Maybe even set it to 10^6, 10^7 or 10^8 and see how it impacts the performance. 3) Try to find another datastructure that can perform faster, or even roll out your own. – tsolakp Jun 01 '17 at 22:15
  • 5
    Yes the need for high performance with large arrays of smallish struct-like objects is imo the very weakest use case for Java (and any other language that doesn't support intrinsic struct arrays). Yes `vector>` produces an internal C++ rep similar to Java's, but the point is you _can't_ describe the equivalent of `vector` in Java at all. In other areas, Java has advantages, but if this use case is critical for your application, you should have a pretty easy time winning the argument. – Gene Jun 02 '17 at 02:20
  • @Gene Thanks for confirming my suspicion. – B.S. Jun 02 '17 at 06:42

1 Answers1

5

The JIT not running thus explains a small part of the effect but not the primary slowdown.

Right, Java is slow on start up because of JIT and it takes some time till it runs with full speed.

But the performance you describe is catastrophic and has another reason: You wrote

It used all cores of my machine (in a sequential program!)

and this must have been the garbage collector. The GC running hard means you're running out of memory. On my machine the times were

  28.689 millis for 1 M pairs
 143.104 millis for 10 M pairs
3100.856 millis for 100 M pairs
  10.404 millis for 1 M pairs
 113.054 millis for 10 M pairs
2528.371 millis for 100 M pairs

which is still a pain, but a possibly usable starting point. Observe that the second run is faster as it gets optimized better. Note that this is not how Java benchmarks should be written!


The reason for the memory consumption is that you have a List of references to objects containing two floats instead of vector of pairs of floats. Every references adds 4 or 8 bytes overhead, every object adds even more. Additionally, there's an indirection on every access.

If memory matters, then this isn't the right way how to code it in Java. There's surely a better way (I made give it try), but the code may get ugly. Java without value types sucks at such computations. IMHO it excels nearly everywhere else (personal opinion).

An equivalent C++ code would use vector<Point*>. If you do this, your code gets slower and memory bigger, but still better than Java (object header overhead).


I rewrote the code so it uses a PointArray storing the two floats next to each other in a single array. Without measuring anything, I claim that the memory consumption is about the same now. The times now are

  31.505 millis for 1 M pairs
 232.658 millis for 10 M pairs
1870.664 millis for 100 M pairs
  17.536 millis for 1 M pairs
 219.222 millis for 10 M pairs
1757.475 millis for 100 M pairs

which is still too slow. I guess, it's the bound checking, which can't be switched off in Java (unless you resolve to Unsafe). Usually, the JIT (just-in-time compiler) can move them out of loops, which makes their cost negligible.

It may be also my slow (factor 1.5) array resizing (IIRC vector uses a factor of 2). Or just a pity when the array gets resized when you're nearly done. As you're doing about nothing with the array, it may weight a lot.

In any case, you need at least one good Java programmer when you want to get fast processing of arrays of primitives. It may take them a day or two till you get good performance. A good library could do as well. Using List<Point> is just too inefficient before Java 10 (or 11, or ...).

maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • I ran a JMH benchmark where I used a specialized `PointList` that stores just the components as `float`s and got 102ms on the n=10^7. C++ with g++ 6.3.0 still wins though, with 0ms. Value types are just so useful in this scenario. – Jorn Vernee Jun 01 '17 at 23:29
  • If you want to penalise C++ further -- or make the test more like-for-like, if you prefer -- then give the `Point` class a vtable as well (adding an empty virtual destructor would do). – Tristan Brindle Jun 01 '17 at 23:50
  • @JornVernee `0ms` sounds like dead code optimization. Or using a funny formula for the result. – maaartinus Jun 02 '17 at 00:00
  • If he's running out of memory running the same test with java vs C++, that's an important thing to note in his decision making process, as well. – xaxxon Jun 02 '17 at 01:18
  • @xaxxon Sure, but 1. IMHO he knows it, 2. Java has about the same memory consumption when using something like my `PointArray`. – maaartinus Jun 02 '17 at 01:46
  • @maartinus Your first argument makes no sense to me. You cite the issue about the JIT not running the first time a function is called and then "explain" it with the GC running out of memory issue. These two are both valid issues but they are unrelated . I do not see a causal relation between them. – B.S. Jun 02 '17 at 06:49
  • PointArray might be a possible way out if we needed only one vector. However, realitly is more complex. For example, vector>>> and variations of it occur (with of course the pair being a struct for readability). The example could represent a directed graph with node and edge weights. Writing all this doubling code by hand is error-prone and defeats the purpose of using a compiler. – B.S. Jun 02 '17 at 06:56
  • @maartinus What is different in Java 10? Is it correct that Java 10 is not yet available? – B.S. Jun 02 '17 at 06:58
  • @xaxxon My test machine had 8 GB of RAM. Java only used 2.2 GB. This means that it technically was not even an out-of-memoy situation. Java just refused to allocate the additional memory required. There probably is some switch that makes the JVM take all of the 8 GB. However, the fact that I need to fine tune JVM parameters based on the program input is very bad. I do not know the exact input when writing the program and deciding what parameters to use. – B.S. Jun 02 '17 at 07:11
  • 1
    Tuning the JVM is a critical part of running any java app... one that they forget to tell you when they say "write once, run anywhere". You're looking for -Xms -Xmx That's starting and max memory which can be specified as 123M or 123G. Increasing the starting memory helps get rid of hiccups when you know it's going to immediately grow to a certain baseline size. Also make sure you're using a 64-bit java runtime. – xaxxon Jun 02 '17 at 07:37
  • @B.S. This was an unfortunate formulation, fixed. +++ Anywhere your object contains another `vector`, the overhead may be well acceptable. Most problematic are tiny objects. +++ Java 9 is soon to come, Java 10 may get value objects, which work like struts in C. +++ Java by default uses some portion of available memory, try starting it with `-Xmx7G` or alike. AFAIK C behaves like Java with `-Xmx99999999999G` (i.e., all you can eat). +++ I used `-Xmx12g -Xms12g`. Some tuning is necessary, but it costs much less time than chasing bugs (IMHO "undefined behavior" in C++ is damn evil). – maaartinus Jun 02 '17 at 09:01