How could this Java code be sped up?

Question

I am trying to benchmark how fast can Java do a simple task: read a huge file into memory and then perform some meaningless calculations on the data. All types of optimizations count. Whether it's rewriting the code differently or using a different JVM, tricking JIT ..

Input file is a 500 million long list of 32 bit integer pairs separated by a comma. Like this:

44439,5023
33140,22257
...

This file takes 5.5GB on my machine. The program can't use more than 8GB of RAM and can use only a single thread.

package speedracer;

import java.io.FileInputStream;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;

public class Main
{
    public static void main(String[] args)
    {
        int[] list = new int[1000000000];

        long start1 = System.nanoTime();
        parse(list);
        long end1 = System.nanoTime();

        System.out.println("Parsing took: " + (end1 - start1) / 1000000000.0);

        int rs = 0;
        long start2 = System.nanoTime();

        for (int k = 0; k < list.length; k++) {
            rs = calc(list[k++], list[k++], list[k++], list[k]);
        }

        long end2 = System.nanoTime();

        System.out.println(rs);
        System.out.println("Calculations took: " + (end2 - start2) / 1000000000.0);
    }

    public static int calc(final int a1, final int a2, final int b1, final int b2)
    {
        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        return c1;
    }

    public static void parse(int[] list)
    {
        FileChannel fc = null;
        int i = 0;

        MappedByteBuffer byteBuffer;

        try {
            fc = new FileInputStream("in.txt").getChannel();

            long size = fc.size();
            long allocated = 0;
            long allocate = 0;

            while (size > allocated) {

               if ((size - allocated) > Integer.MAX_VALUE) {
                   allocate = Integer.MAX_VALUE;
               } else {
                   allocate = size - allocated;
               }

               byteBuffer = fc.map(FileChannel.MapMode.READ_ONLY, allocated, allocate);
               byteBuffer.clear();

               allocated += allocate;

               int number = 0;

               while (byteBuffer.hasRemaining()) {
                   char val = (char) byteBuffer.get();
                   if (val == '\n' || val == ',') {
                        list[i] = number;

                        number = 0;
                        i++;
                   } else {
                       number = number * 10 + (val - '0');
                   }
                }
            }

            fc.close();

        } catch (Exception e) {
            System.err.println("Parsing error: " + e);
        }
    }
}

I've tried all I could think of. Trying different readers, tried openjdk6, sunjdk6, sunjdk7. Tried different readers. Had to do some ugly parsing since MappedByteBuffer cannot map more than 2GB of memory at once. I'm running:

   Linux AS292 2.6.38-11-generic #48-Ubuntu SMP 
   Fri Jul 29 19:02:55 UTC 2011 
   x86_64 GNU/Linux. Ubuntu 11.04. 
   CPU: is Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz.

Currently, my results are for parsing: 26.50s, calculations: 11.27s. I'm competing against a similar C++ benchmark which does the IO in roughly the same time but the calculations take only 4.5s. My main objective is to reduce the calculation time in any means possible. Any ideas?

Update: It seems the main speed improvement could come from what is called Auto-Vectorization. I was able to find some hints that the current Sun's JIT only does "some vectorization" however I can't really confirm it. It would be great to find some JVM or JIT that would have better auto-vectorization optimization support.

Did the C++ application run on the same machine as your Java application? Cause if it was on a different machine, that could easily mean different performance characteristics. — Drizzt321, Sep 16 '11 at 23:31
You're going to get an arrayoutofbounds exception with the calc parameters. You've over allocating the List. Additionally, just remove the whole calc method. It doesn't do anything to the result or original data, nor does it store the result in some fashion. — monksy, Sep 16 '11 at 23:34
did you already try with the -server switch? The server VM should be quite a bit faster than the default -client one. See http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client — fvu, Sep 16 '11 at 23:40
@monksy: forgot to mention I run the program with -Xmx6048m. The calc method is part of the task to see how fast Java can do these operations. — Zilvinas, Sep 16 '11 at 23:41
Make sure you're running 64-bit server rather than 32-bit client, and that you have Java 7 (it's quite a bit faster than Java 6) — Luigi Plinge, Sep 16 '11 at 23:44
@fvu: Tried it before. I think it's automatically server? Doesn't make a difference if I pass it explicitly. — Zilvinas, Sep 16 '11 at 23:46
One thing that might work is to declare the methods to be final. This avoids some run time type identifcation overhead, though I'm not sure how it affects static methods. — Jems, Sep 16 '11 at 23:47
Also you should watch your processor usage. If it's only around 50% you can probably increase performance by doing half the calculations in another thread — Luigi Plinge, Sep 16 '11 at 23:48
@Luigi: I'm running zilvinas@AS292:~$ java -version java version "1.7.0" Java(TM) SE Runtime Environment (build 1.7.0-b147) Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode). Nah It has to be on a single core ;) Have to mention it in the constraints. — Zilvinas, Sep 16 '11 at 23:48
@Jems: Tried it. Didn't make a difference. Netbeans suggests to remove the final flag from methods that are declared static. — Zilvinas, Sep 16 '11 at 23:59
Do you really need a text file? You can't save the int's as raw types? Your file would be much smaller and it might run faster. There might be some issues with big/little endian if you work on different platforms. — toto2, Sep 17 '11 at 00:21
@toto2: It's a part of a benchmark constraints. The input file has to stay the way it is. — Zilvinas, Sep 17 '11 at 00:29
Any idea what how the C++ version was compiled? Which compiler? Version? Optimization level? Enhanced instruction sets? I'm wondering if it's even possible to the beat the C++ version in the first place — Mysticial, Sep 17 '11 at 00:33
I'm asking this because that "meaningless computation" can be super-optimized to something extremely efficient... Perhaps the C++ compiler is able to do it, but not the Java compiler or JIT. — Mysticial, Sep 17 '11 at 00:37
@Mystical: It was compiled using `g++ -O3 -Wall -c -fmessage-length=0 -MMD -MP -MF"main.d" -MT"main.d" -o "main.o" "../main.cpp" g++ -o "perf" ./main.o -lboost_program_options` the compiler version is `g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2`. To my knowledge SSE2 was not used. — Zilvinas, Sep 17 '11 at 00:56
I'm gonna add an answer in a moment. I'll include my "optimized" version that loop. — Mysticial, Sep 17 '11 at 01:22

Mysticial · Accepted Answer · 2011-09-17T02:51:23.317

4

First of all, -O3 enables:

-finline-functions
-ftree-vectorize

among others...

So it looks like it actually might be vectorizing.

EDIT : This has been been confirmed. (see comments) The C++ version is indeed being vectorized by the compiler. With vectorization disabled, the C++ version actually runs a bit slower than the Java version

Assuming the JIT does not vectorize the loop, it may be difficult/impossible for the Java version to match the speed of the C++ version.

Now, if I were a smart C/C++ compiler, here's how I would arrange that loop (on x64):

int c1 = (a1 + a2) ^ a2;
int c2 = (b1 - b2) << 4;

int tmp0 = c1;
int tmp1 = 0;
int tmp2 = 0;
int tmp3 = 0;

int z0 = 0;
int z1 = 1;
int z2 = 2;
int z3 = 3;

do{
    tmp0 ^= z0 + c2;
    tmp1 ^= z1 + c2;
    tmp2 ^= z2 + c2;
    tmp3 ^= z3 + c2;
    z0 += 4;
    z1 += 4;
    z2 += 4;
    z3 += 4;
}while (z0 < 100);

tmp0 ^= tmp1;
tmp2 ^= tmp3;

tmp0 ^= tmp2;

return tmp0;

Note that this loop is completely vectorizable.

Even better, I would completely unroll this loop. These are things that a C/C++ compiler will do. But now the question, is will the JIT do it?

edited Sep 17 '11 at 02:51

answered Sep 17 '11 at 01:29

Mysticial

464,885
45
335
332

Amazing attempt however the result is still 11s. I'll try the O2 on C++ and see what happens and let you know in a min. – Zilvinas Sep 17 '11 at 01:38
The O2 version takes 13s. So it O3 that makes the C++ version incredibly fast. – Zilvinas Sep 17 '11 at 01:48
I'm guessing the C++ version has the same identical loop? (the syntax is the same for basic loops like this) Just curious what would happen if you copied this loop into the C++ version. I wonder if it'd make the C++ version even faster... lol – Mysticial Sep 17 '11 at 01:49
Ah... just saw your comment... So `-O3` made the difference. Maybe it really is vectorizing it. Only way to find out is to see the assembly dump. :( – Mysticial Sep 17 '11 at 01:50
Yes you were completely right. It's ftree-vectorize that makes all the difference. I've compiled it with `-O2 -ftree-vectorize` and it made it go from 13s to 4.8s. – Zilvinas Sep 17 '11 at 02:05
Oh, wow, didn't realize I was right on! So I guess the conclusion is: The Java version isn't gonna get much faster than 11s unless we can make it vectorize. – Mysticial Sep 17 '11 at 02:11
I will approve your answer within a few days if noone comes up with a magical solution or a JVM implementation that supports auto-vectorization like gcc ;) – Zilvinas Sep 17 '11 at 03:29

score 1 · Answer 2 · answered Sep 16 '11 at 23:47

1

Use the Hotspot JVM in server mode, and make sure to warm it up. Also give enough time for the garbage collection algorithms to settle down to a stable pace if collection is a major part of your test. I don't see anything at a glance that makes me think it would be...

answered Sep 16 '11 at 23:47

Ryan Stewart

126,015
21
180
199

Good point but I've already tried it. 64bit JVM by default runs in -server mode. Also the calc method gets executed 250mil times so HotSpot identifies it rather quickly. However I still tried to run the calc method outside and then once again while measuring the execution time and it didn't make any difference. – Zilvinas Sep 17 '11 at 00:06

score 1 · Answer 3 · answered Sep 17 '11 at 00:48

Interesting question. :-) This is probably more of a comment since I won't really answer your question, but it's too long for the comment box.

Micro-benchmarking in Java is tricky because the JIT can go nuts with optimizations. But this particular code tricks the JIT in such a way that it somehow cannot perform its normal optimizations.

Normally, this code would run in O(1) time because your main loop has no effect on anything:

    for (int k = 0; k < list.length; k++) {
        rs = calc(list[k++], list[k++], list[k++], list[k]);
    }

Note that the final result of rs doesn't really depend on running all iterations of the loop; just the last one. You can calculate the final value of "k" for the loop without having to actually run the loop. Normally the JIT would notice that and turn your loop into a single assignment, it it's able to detect that the function being called (calc) has no side-effects (which it doesn't).

But, somehow, this statement in the calc() function messes up the JIT:

        c1 ^= z + c2;

Somehow that adds too much complexity for the JIT to decide that all this code in the end doesn't change anything and that the original loop can be optimized out.

If you change that particular statement to something even more pointless, like:

        c1 = z + c2;

Then the JIT picks things up and optimizes your loops away. Try it out. :-)

I tried locally with a much smaller data set and with the "^=" version calculations took ~1.6s, while with the "=" version they took 0.007 seconds (or, in other words, it optimized away the loop).

As I said, not really a response, but I thought this might be interesting.

As @Mysticial already pointed out I've added a print statement to force JIT to run the loop. If you remove the XOR it still takes 0.007 seconds which I guess means it's just a lot faster but the whole loop is still being run. — Zilvinas, Sep 17 '11 at 01:06
@Zilvinas: What print statement? If you mean the one that prints "rs", that doesn't affect the loop, because as I pointed out, you can calculate the value of rs without running the loop. Taking the XOR out makes the JIT realize that this particular loop can be optimized, which then causes it to realize that the loop in main() can also be optimized out. I just don't know why it doesn't do that with the XOR in place. I get the "0.007s" run time too if I remove the "rs =" assignment from the calc() call site, which means it's just not running the loop in those cases.. — vanza, Sep 17 '11 at 01:13
if i comment out the `System.out.println(rs);` the loop runs 0.007s. — Zilvinas, Sep 17 '11 at 01:20
Exactly my point. Removing the XOR has the same effect, it makes the JIT realize the loop is pointless and just doesn't run it. Removing the print has the same effect, since the loop is modifying `rs` and `k` and now neither variable would be used after the loop. As I said, it doesn't explain how to make it faster, but it hints at why the JIT isn't being more helpful. — vanza, Sep 17 '11 at 01:30

score 0 · Answer 4 · answered Sep 17 '11 at 00:16

0

Did you try "inlining" parse() and calc(), i.e. put all the code in main()?

answered Sep 17 '11 at 00:16

mbatchkarov

15,487
9
60
79

Yes I did. That was first version of my code. I believe it get's HotSpotted, JiTed and inlined automatically very quickly. – Zilvinas Sep 17 '11 at 00:33

score 0 · Answer 5 · answered Sep 17 '11 at 00:16

What is the score if you move the few lines of your calc function inside of your list iteration?
I know it's not very clean, but you'll gain over the call stack.

[...]
    for (int k = 0; k < list.length; k++) {
        int a1 = list[k++];
        int a2 = list[k++];
        int b1 = list[k++];
        int b2 = list[k];

        int c1 = (a1 + a2) ^ a2;
        int c2 = (b1 - b2) << 4;

        for (int z = 0; z < 100; z++) {
            c1 ^= z + c2;
        }

        rs = c1;
    }

Makes no difference. I think that HotSpot identifies a 'hot spot' very quickly JITs the 'algorithm' and inlines it. — Zilvinas, Sep 17 '11 at 00:32

score 0 · Answer 6 · answered Sep 17 '11 at 00:17

0

The MappedByteBuffer is only contributing about 20% in I/O performance and it is an enormous memory cost - if it causes swapping the cure is worse than the disease.

I would use a BufferedReader around a FileReader, and maybe a Scanner around that to get the integers, or at least Integer.parseInt(), which is a lot more likely to have been warmed up by HotSpot than your own radix conversion code.

answered Sep 17 '11 at 00:17

user207421

305,947
44
307
483

1

OK fair point about memory cost however just for this benchmark I can spare the memory. BuferedReader, util.split and Integer.parseInt was my first choice. It actually took 8 mins to read the file into memory. Just as an experiment I just ran this code: `BufferedReader in = new BufferedReader(new FileReader("in.txt")); String line; int i = 0; while ((line = in.readLine()) != null) { list[i++] = 1; list[i++] = 2; }` and this took 47s to run. – Zilvinas Sep 17 '11 at 00:49

Stephen C · Answer 7 · 2011-09-17T01:39:32.947

I am trying to benchmark how fast can Java do a simple task: read a huge file into memory and then perform some meaningless calculations on the data.

If the task is to do a meaningless calculation, then the best optimization is to not do the calculation.

If what you are really trying to do here is to figure out if there is a general technique to make a computation go faster, then I think you are barking up the wrong tree. There is no such technique. What you learn on optimizing a meaningless calculation is not likely to apply to other (hopefully meaningfull) calculations.

If calculation is not meaningless, and the aim is to make the whole program go faster, you've probably already reached the point where optimization is a waste of time.

Current (Java) - 26.50s + 11.27s = ~38 seconds
Goal (C++) - ~26.5s + 4.50 = ~31 seconds
Percentage speedup - less than 20%.

A speedup of less than 20% for a ~40 second computation is probably not worth the effort. It is cheaper to get the user to twiddle his thumbs for those extra 7 seconds.

This is also telling you something interesting. That in this scenario, it doesn't make much difference in relative terms whether you use C++ or Java. The overall performance of the program is dominated by a phase in which C++ and Java are comparable.

While I agree it's meaningless exercise I'm looking for ideas from enthusiasts who enjoy tinkering out of plain curiosity to see the ultimate limits (in a very narrow way) of a JVM. — Zilvinas, Sep 17 '11 at 01:55

How could this Java code be sped up?

7 Answers7

Linked