Why is long slower than int in x64 Java?

Question

I'm running Windows 8.1 x64 with Java 7 update 45 x64 (no 32 bit Java installed) on a Surface Pro 2 tablet.

The code below takes 1688ms when the type of i is a long and 109ms when i is an int. Why is long (a 64 bit type) an order of magnitude slower than int on a 64 bit platform with a 64 bit JVM?

My only speculation is that the CPU takes longer to add a 64 bit integer than a 32 bit one, but that seems unlikely. I suspect Haswell doesn't use ripple-carry adders.

I'm running this in Eclipse Kepler SR1, btw.

public class Main {

    private static long i = Integer.MAX_VALUE;

    public static void main(String[] args) {    
        System.out.println("Starting the loop");
        long startTime = System.currentTimeMillis();
        while(!decrementAndCheck()){
        }
        long endTime = System.currentTimeMillis();
        System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
    }

    private static boolean decrementAndCheck() {
        return --i < 0;
    }

}

Edit: Here are the results from equivalent C++ code compiled by VS 2013 (below), same system. ~~long: 72265ms int: 74656ms~~ Those results were in debug 32 bit mode.

In 64 bit release mode: ~~long: 875ms~~ long long: 906ms int: 1047ms

This suggests that the result I observed is JVM optimization weirdness rather than CPU limitations.

#include "stdafx.h"
#include "iostream"
#include "windows.h"
#include "limits.h"

long long i = INT_MAX;

using namespace std;


boolean decrementAndCheck() {
return --i < 0;
}


int _tmain(int argc, _TCHAR* argv[])
{


cout << "Starting the loop" << endl;

unsigned long startTime = GetTickCount64();
while (!decrementAndCheck()){
}
unsigned long endTime = GetTickCount64();

cout << "Finished the loop in " << (endTime - startTime) << "ms" << endl;



}

Edit: Just tried this again in Java 8 RTM, no significant change.

The most likely suspect is your set up, not the CPU or the various parts of the JVM. Can you reliably reproduce this measurement? Not repeating the loop, not warming up the JIT, using `currentTimeMillis()`, running code that can trivially be optimized away completely, etc. reeks of unreliable results. — , Nov 07 '13 at 18:44
I was benchmarking a while ago, I had to use a `long` as the loop counter, because the JIT compiler optimized the loop out, when I used an `int`. One would need to look at the disassembly of the generated machine code. — Sam, Nov 07 '13 at 18:46
This is not a correct microbenchmark, and I would not expect that its results reflect reality in any way. — Louis Wasserman, Nov 07 '13 at 19:18
Wait, really? The C++ version is fifty times slower than the Java version? — chrylis -cautiouslyoptimistic-, Nov 07 '13 at 19:21
Hmmm. Performance will improve about 30% if the loop is given time to "warm up" (must move the loop into separate method and invoke repeatedly). But the ratio between the two remains about the same. Having worked on the iSeries JVM where there would have been essentially no difference, I have to posit that the Sun/Oracle implementation is simply effed up in this area. — Hot Licks, Nov 07 '13 at 19:44
The C++ code isn't equivalent because MSVC++ uses the LLP64 data model (`long` is 32 bits) and Java uses LP64. Try using `long long` in your C++ code. — dan04, Nov 07 '13 at 19:50
It's not the inherent speed of addition, which on Haswell (and most other 64bit platforms) is the same for 32bit and 64bit integers. See [here](http://users.atw.hu/instlatx64/GenuineIntel00306C3_Haswell_InstLatX64.txt) for the exact timings. — harold, Nov 07 '13 at 19:52
@dan04 You are right about int and long being the same in MS C++. I forgot about that. The time on this system with a long long is 906ms. — Techrocket9, Nov 07 '13 at 19:56
You may want to look into using Java Microbenchmark Harness for performing tests of this kind: http://openjdk.java.net/projects/code-tools/jmh/ — Andrew Bissell, Nov 07 '13 at 19:58
All of the comments berating the OP for failing to write a proper Java microbenchmark are unspeakably lazy. This is the sort of thing that's very easy to figure out if you just look and see what the JVM does to the code. — tmyklebu, Nov 07 '13 at 20:03
@tmyklebu: What's wrong with the comments? The OP wrote a **totally broken benchmark** and the JVM makes some funny things out of it. The results are meaningless. Installing the disassembler and deciphering the output takes some time which could be better spend on a good benchmark. — maaartinus, Nov 07 '13 at 20:41
@maaartinus: Accepted practice is accepted practice because it works around a list of known pitfalls. In the case of Proper Java Benchmarks, you want to make sure you're measuring properly optimised code, not an on-stack replacement, and you want to make sure your measurements are clean at the end. OP found a completely different issue and the benchmark he provided adequately demonstrated it. And, as noted, turning this code into a Proper Java Benchmark doesn't actually make the weirdness go away. And reading assembly code isn't hard. — tmyklebu, Nov 07 '13 at 20:48
@tmyklebu Not to mention that the HotSpot JVM won't necessarily generate the relevant assembly until it decides that the code is hot enough to warrant it. — chrylis -cautiouslyoptimistic-, Nov 07 '13 at 21:47
@tmyklebu: Trying to optimize any code I run into such weirdnesses again and again. The point is that *they don't matter at all*, the JVM is free to execute *cold or useless code in any stupid way* it wants. **What matters is how it optimizes useful and hot code**, and the one shown here is really far from it. — maaartinus, Nov 07 '13 at 22:45
@maaartinus: You need to write the useful code somehow, though, and you want to be able to understand its performance. That's what microbenchmarking and looking at assembly code is for; you write a piece of useless code that does something that you hope is similar to your real workload, then you look at the generated assembly, then you draw whatever conclusions the results suggest. The key step here is looking at the generated code and deciding whether you're measuring what you wanted to, and no level of adherence to benchmark-writing dogma obviates the need for that step. — tmyklebu, Nov 08 '13 at 05:18
I have the same difference int 47ms, long 750ms. It seems that the issue appear only by decrement by one. if you change --i to i = i-10 the benchmark result for int and long are the same. — Robert, Sep 26 '19 at 23:50

score 84 · Accepted Answer · edited Mar 26 '15 at 03:07

My JVM does this pretty straightforward thing to the inner loop when you use longs:

0x00007fdd859dbb80: test   %eax,0x5f7847a(%rip)  /* fun JVM hack */
0x00007fdd859dbb86: dec    %r11                  /* i-- */
0x00007fdd859dbb89: mov    %r11,0x258(%r10)      /* store i to memory */
0x00007fdd859dbb90: test   %r11,%r11             /* unnecessary test */
0x00007fdd859dbb93: jge    0x00007fdd859dbb80    /* go back to the loop top */

It cheats, hard, when you use ints; first there's some screwiness that I don't claim to understand but looks like setup for an unrolled loop:

0x00007f3dc290b5a1: mov    %r11d,%r9d
0x00007f3dc290b5a4: dec    %r9d
0x00007f3dc290b5a7: mov    %r9d,0x258(%r10)
0x00007f3dc290b5ae: test   %r9d,%r9d
0x00007f3dc290b5b1: jl     0x00007f3dc290b662
0x00007f3dc290b5b7: add    $0xfffffffffffffffe,%r11d
0x00007f3dc290b5bb: mov    %r9d,%ecx
0x00007f3dc290b5be: dec    %ecx              
0x00007f3dc290b5c0: mov    %ecx,0x258(%r10)   
0x00007f3dc290b5c7: cmp    %r11d,%ecx
0x00007f3dc290b5ca: jle    0x00007f3dc290b5d1
0x00007f3dc290b5cc: mov    %ecx,%r9d
0x00007f3dc290b5cf: jmp    0x00007f3dc290b5bb
0x00007f3dc290b5d1: and    $0xfffffffffffffffe,%r9d
0x00007f3dc290b5d5: mov    %r9d,%r8d
0x00007f3dc290b5d8: neg    %r8d
0x00007f3dc290b5db: sar    $0x1f,%r8d
0x00007f3dc290b5df: shr    $0x1f,%r8d
0x00007f3dc290b5e3: sub    %r9d,%r8d
0x00007f3dc290b5e6: sar    %r8d
0x00007f3dc290b5e9: neg    %r8d
0x00007f3dc290b5ec: and    $0xfffffffffffffffe,%r8d
0x00007f3dc290b5f0: shl    %r8d
0x00007f3dc290b5f3: mov    %r8d,%r11d
0x00007f3dc290b5f6: neg    %r11d
0x00007f3dc290b5f9: sar    $0x1f,%r11d
0x00007f3dc290b5fd: shr    $0x1e,%r11d
0x00007f3dc290b601: sub    %r8d,%r11d
0x00007f3dc290b604: sar    $0x2,%r11d
0x00007f3dc290b608: neg    %r11d
0x00007f3dc290b60b: and    $0xfffffffffffffffe,%r11d
0x00007f3dc290b60f: shl    $0x2,%r11d
0x00007f3dc290b613: mov    %r11d,%r9d
0x00007f3dc290b616: neg    %r9d
0x00007f3dc290b619: sar    $0x1f,%r9d
0x00007f3dc290b61d: shr    $0x1d,%r9d
0x00007f3dc290b621: sub    %r11d,%r9d
0x00007f3dc290b624: sar    $0x3,%r9d
0x00007f3dc290b628: neg    %r9d
0x00007f3dc290b62b: and    $0xfffffffffffffffe,%r9d
0x00007f3dc290b62f: shl    $0x3,%r9d
0x00007f3dc290b633: mov    %ecx,%r11d
0x00007f3dc290b636: sub    %r9d,%r11d
0x00007f3dc290b639: cmp    %r11d,%ecx
0x00007f3dc290b63c: jle    0x00007f3dc290b64f
0x00007f3dc290b63e: xchg   %ax,%ax /* OK, fine; I know what a nop looks like */

then the unrolled loop itself:

0x00007f3dc290b640: add    $0xfffffffffffffff0,%ecx
0x00007f3dc290b643: mov    %ecx,0x258(%r10)
0x00007f3dc290b64a: cmp    %r11d,%ecx
0x00007f3dc290b64d: jg     0x00007f3dc290b640

then the teardown code for the unrolled loop, itself a test and a straight loop:

0x00007f3dc290b64f: cmp    $0xffffffffffffffff,%ecx
0x00007f3dc290b652: jle    0x00007f3dc290b662
0x00007f3dc290b654: dec    %ecx
0x00007f3dc290b656: mov    %ecx,0x258(%r10)
0x00007f3dc290b65d: cmp    $0xffffffffffffffff,%ecx
0x00007f3dc290b660: jg     0x00007f3dc290b654

So it goes 16 times faster for ints because the JIT unrolled the int loop 16 times, but didn't unroll the long loop at all.

For completeness, here is the code I actually tried:

public class foo136 {
  private static int i = Integer.MAX_VALUE;
  public static void main(String[] args) {
    System.out.println("Starting the loop");
    for (int foo = 0; foo < 100; foo++)
      doit();
  }

  static void doit() {
    i = Integer.MAX_VALUE;
    long startTime = System.currentTimeMillis();
    while(!decrementAndCheck()){
    }
    long endTime = System.currentTimeMillis();
    System.out.println("Finished the loop in " + (endTime - startTime) + "ms");
  }

  private static boolean decrementAndCheck() {
    return --i < 0;
  }
}

The assembly dumps were generated using the options -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly. Note that you need to mess around with your JVM installation to have this work for you as well; you need to put some random shared library in exactly the right place or it will fail.

I'm actually kind of surprised that the JIT can't eliminate all the calls to `decrementAndCheck()` as dead code since they have no side effects. — Andrew Bissell, Nov 07 '13 at 20:02
@AndrewBissell: It could just turn `doit` into timing code around `i = 0`. It doesn't. It turns it into different things in ways that apparently depend on the data type. — tmyklebu, Nov 07 '13 at 20:06
@tmyklebu What does it do if you change the comparison variable for the `int` from zero to something non-zero? I'm wondering if it doesn't realize it can do this optimization because of the convoluted comparison step required for the `long` version and if it would still unroll the `int` loop if it had to (logically) execute the subtract operation. — chrylis -cautiouslyoptimistic-, Nov 07 '13 at 20:38
@chrylis: Same crazy setup/vectorised loop/teardown. At least when I changed 0 to 42. — tmyklebu, Nov 07 '13 at 20:41
OK, so the net-net isn't that the `long` version is slower, but rather than the `int` version is faster. That makes sense. Likely not as much effort was invested in making the JIT optimize `long` expressions. — Hot Licks, Nov 07 '13 at 21:45
...pardon my ignorance, but what is "funrolled"? I can't even seem to google the term properly, and that makes this the first time I've had to ask someone what a word means on the internet. — BrianH, Nov 08 '13 at 06:06
@BrianDHall `gcc` uses `-f` as the command-line switch for "flag", and the `unroll-loops` optimization is turned on by saying `-funroll-loops`. I just use "unroll" to describe the optimization. — chrylis -cautiouslyoptimistic-, Nov 08 '13 at 08:14
@AndrewBissell it can't remove a function call, since you could swap it for a function that has side-effects at runtime. — BRPocock, Nov 12 '13 at 18:19
Just to be clear, it didn't "funroll" it. It unrolled it AND converted the unrolled loop to `i-=16`, which of course is 16x faster. — Aleksandr Dubinsky, Nov 23 '13 at 08:38

score 22 · Answer 2 · edited May 23 '17 at 11:47

22

The JVM stack is defined in terms of words, whose size is an implementation detail but must be at least 32 bits wide. The JVM implementer may use 64-bit words, but the bytecode can't rely on this, and so operations with long or double values have to be handled with extra care. In particular, the JVM integer branch instructions are defined on exactly the type int.

In the case of your code, disassembly is instructive. Here's the bytecode for the int version as compiled by the Oracle JDK 7:

private static boolean decrementAndCheck();
  Code:
     0: getstatic     #14  // Field i:I
     3: iconst_1      
     4: isub          
     5: dup           
     6: putstatic     #14  // Field i:I
     9: ifge          16
    12: iconst_1      
    13: goto          17
    16: iconst_0      
    17: ireturn

Note that the JVM will load the value of your static i (0), subtract one (3-4), duplicate the value on the stack (5), and push it back into the variable (6). It then does a compare-with-zero branch and returns.

The version with the long is a bit more complicated:

private static boolean decrementAndCheck();
  Code:
     0: getstatic     #14  // Field i:J
     3: lconst_1      
     4: lsub          
     5: dup2          
     6: putstatic     #14  // Field i:J
     9: lconst_0      
    10: lcmp          
    11: ifge          18
    14: iconst_1      
    15: goto          19
    18: iconst_0      
    19: ireturn

First, when the JVM duplicates the new value on the stack (5), it has to duplicate two stack words. In your case, it's quite possible that this is no more expensive than duplicating one, since the JVM is free to use a 64-bit word if convenient. However, you'll notice that the branch logic is longer here. The JVM doesn't have an instruction to compare a long with zero, so it has to push a constant 0L onto the stack (9), do a general long comparison (10), and then branch on the value of that calculation.

Here are two plausible scenarios:

The JVM is following the bytecode path exactly. In this case, it's doing more work in the long version, pushing and popping several extra values, and these are on the virtual managed stack, not the real hardware-assisted CPU stack. If this is the case, you'll still see a significant performance difference after warmup.
The JVM realizes that it can optimize this code. In this case, it's taking extra time to optimize away some of the practically unnecessary push/compare logic. If this is the case, you'll see very little performance difference after warmup.

I recommend you write a correct microbenchmark to eliminate the effect of having the JIT kick in, and also trying this with a final condition that isn't zero, to force the JVM to do the same comparison on the int that it does with the long.

edited May 23 '17 at 11:47

Community

1
1

answered Nov 07 '13 at 19:13

chrylis -cautiouslyoptimistic-

75,269
21
115
152

Should not this affect 32 bit JVMs as well? Other comments and answer indicates that 32 bit JVM doesn't show this symptom. – Katona Nov 07 '13 at 19:27
"[Forcing] the JVM to do the same comparison on the int that it does with the long" seems like adjusting your benchmark to stop measuring the difference it's trying to measure. – tmyklebu Nov 07 '13 at 19:31
1

@Katona Not necessarily. Very especially, the Client and Server HotSpot JVMs are completely different implementations, and Ilya didn't indicate selecting Server (Client is usually the 32-bit default). – chrylis -cautiouslyoptimistic- Nov 07 '13 at 19:34
1

@tmyklebu The issue is that the benchmark is measuring several different things at once. Using a nonzero terminal condition reduces the number of variables. – chrylis -cautiouslyoptimistic- Nov 07 '13 at 19:35
@chrylis: Yeah, the benchmark is measuring an actual phenomenon that happens with actual JVMs in a certain situation. There are issues (warmup and stuff) that sometimes matter, but they don't seem to come into play here. Playing around with it might make it stop measuring that phenomenon. That means you made a useless benchmark, not that the phenomenon doesn't exist. – tmyklebu Nov 07 '13 at 19:58
1

@tmyklebu The point is that the OP had intended to compare the speed of increments, decrements and comparisions on ints vs longs. Instead (assuming this answer is correct) they were measuring only comparisons, and only against 0, which is a special case. If nothing else, it makes the original benchmark misleading -- it looks like it measures three general cases, when in fact it measures one, specific case. – yshavit Nov 07 '13 at 20:17
@yshavit: Yeah, maybe he intended to compare those. He stumbled on a weird effect that's worth understanding---incidentally, a weird effect that making a Proper Java Benchmark wouldn't eliminate---and asked SO why it was happening. Making the code into a Proper Java Benchmark would only waste his time. – tmyklebu Nov 07 '13 at 20:19
1

@tmyklebu Don't get me wrong, I upvoted the question, this answer and your answer. But I disagree with your statement that @chrylis is adjusting the benchmark to stop measuring the difference it's trying to measure. OP can correct me if I'm wrong, but it doesn't look like they're trying to only/primarily measure `== 0`, which appears to be a disproportionately big part of the benchmark results. Seems to me more likely that OP is trying to measure a more general range of operations, and this answer points out that the benchmark is highly skewed towards just one of those operations. – yshavit Nov 07 '13 at 20:21
@yshavit: You try to measure the difference between `int` stuff and `long` stuff. Then you stumble upon this. Your immediate goal needs to shift; you need to figure out WTF is going on in your benchmark and why it's producing such implausible results before continuing with your original goal. The benchmark winds up measuring how much your JVM feels like unrolling a trivial loop, which means you need to write something less trivial---potentially far less trivial---to continue. – tmyklebu Nov 07 '13 at 20:26
@yshavit: Both. You seem to prefer trying all possible fixes rather than digging into the root cause of these sorts of things, though, and that's what I'm taking issue with. – tmyklebu Nov 07 '13 at 20:40
2

@tmyklebu Not at all. I'm all for understanding the root causes. But, having identified that one major root cause is that the benchmark was skewed, it's not invalid to change the benchmark to remove the skew, _as well as_ to dig in and understand more about that skew (for instance, that it can enable more efficient bytecode, that it can make it easier to unroll loops, etc). That's why I upvoted both this answer (which identified the skew) and yours (which digs into the skew in more detail). – yshavit Nov 07 '13 at 20:48

score 8 · Answer 3 · answered Nov 07 '13 at 19:03

Basic unit of data in a Java Virtual Machine is word. Choosing the right word size is left upon the implementation of the JVM. A JVM implementation should choose a minimum word size of 32 bits. It can choose a higher word size to gain efficiency. Neither there is any restriction that a 64 bit JVM should choose 64 bit word only.

The underlying architecture doesn't rules that the word size should also be the same. JVM reads/writes data word by word. This is the reason why it might be taking longer for a long than an int.

Here you can find more on the same topic.

score 4 · Answer 4 · edited May 23 '17 at 12:34

I have just written a benchmark using caliper.

The results are quite consistent with the original code: a ~12x speedup for using int over long. It certainly seems that the loop unrolling reported by tmyklebu or something very similar is going on.

timeIntDecrements         195,266,845.000
timeLongDecrements      2,321,447,978.000

This is my code; note that it uses a freshly-built snapshot of caliper, since I could not figure out how to code against their existing beta release.

package test;

import com.google.caliper.Benchmark;
import com.google.caliper.Param;

public final class App {

    @Param({""+1}) int number;

    private static class IntTest {
        public static int v;
        public static void reset() {
            v = Integer.MAX_VALUE;
        }
        public static boolean decrementAndCheck() {
            return --v < 0;
        }
    }

    private static class LongTest {
        public static long v;
        public static void reset() {
            v = Integer.MAX_VALUE;
        }
        public static boolean decrementAndCheck() {
            return --v < 0;
        }
    }

    @Benchmark
    int timeLongDecrements(int reps) {
        int k=0;
        for (int i=0; i<reps; i++) {
            LongTest.reset();
            while (!LongTest.decrementAndCheck()) { k++; }
        }
        return (int)LongTest.v | k;
    }    

    @Benchmark
    int timeIntDecrements(int reps) {
        int k=0;
        for (int i=0; i<reps; i++) {
            IntTest.reset();
            while (!IntTest.decrementAndCheck()) { k++; }
        }
        return IntTest.v | k;
    }
}

score 1 · Answer 5 · answered Nov 07 '13 at 19:47

For the record, this version does a crude "warmup":

public class LongSpeed {

    private static long i = Integer.MAX_VALUE;
    private static int j = Integer.MAX_VALUE;

    public static void main(String[] args) {

        for (int x = 0; x < 10; x++) {
            runLong();
            runWord();
        }
    }

    private static void runLong() {
        System.out.println("Starting the long loop");
        i = Integer.MAX_VALUE;
        long startTime = System.currentTimeMillis();
        while(!decrementAndCheckI()){

        }
        long endTime = System.currentTimeMillis();

        System.out.println("Finished the long loop in " + (endTime - startTime) + "ms");
    }

    private static void runWord() {
        System.out.println("Starting the word loop");
        j = Integer.MAX_VALUE;
        long startTime = System.currentTimeMillis();
        while(!decrementAndCheckJ()){

        }
        long endTime = System.currentTimeMillis();

        System.out.println("Finished the word loop in " + (endTime - startTime) + "ms");
    }

    private static boolean decrementAndCheckI() {
        return --i < 0;
    }

    private static boolean decrementAndCheckJ() {
        return --j < 0;
    }

}

The overall times improve about 30%, but the ratio between the two remains roughly the same.

@TedHopp - I tried changing the loop limits in mine and it remained essentially unchanged. — Hot Licks, Nov 07 '13 at 20:00
@Techrocket9: I get similar numbers (`int` is 20ish times faster) with this code. — tmyklebu, Nov 07 '13 at 20:44

score 1 · Answer 6 · answered Nov 07 '13 at 20:35

1

For the records:

if i use

boolean decrementAndCheckLong() {
    lo = lo - 1l;
    return lo < -1l;
}

(changed "l--" to "l = l - 1l") long performance improves by ~50%

answered Nov 07 '13 at 20:35

R.Moeller

3,436
1
17
12

score 1 · Answer 7 · answered Nov 12 '20 at 08:40

It's likely due to the JVM checking for safepoints when long is used (uncounted loop), and not doing it for int (counted loop).

Some references: https://stackoverflow.com/a/62557768/14624235

https://stackoverflow.com/a/58726530/14624235

http://psy-lob-saw.blogspot.com/2016/02/wait-for-it-counteduncounted-loops.html

Durandal · Answer 8 · 2013-11-07T20:43:44.037

0

I don't have a 64 bit machine to test with, but the rather large difference suggests that there is more than the slightly longer bytecode at work.

I see very close times for long/int (4400 vs 4800ms) on my 32-bit 1.7.0_45.

This is only a guess, but I strongly suspect that it is the effect of a memory misalignment penalty. To confirm/deny the suspicion, try adding a public static int dummy = 0; before the declaration of i. That will push i down by 4 bytes in memory layout and may make it properly aligned for better performance. Confirmed to be not causing the issue.

EDIT: ~~The reasoning behind this is that the VM may not reorder fields at its leisure adding padding for optimal alignment, since that may interfere with JNI~~ (Not the case).

edited Nov 07 '13 at 20:43

answered Nov 07 '13 at 19:23

Durandal

19,919
4
36
70

The VM certainly *is* allowed to reorder fields and add padding. – Hot Licks Nov 07 '13 at 19:49
JNI has to access objects through these annoying, slow accessor methods that take a few opaque handles anyway since GC can happen while native code is running. It's plenty free to reorder fields and add padding. – tmyklebu Nov 07 '13 at 20:38

Why is long slower than int in x64 Java?

8 Answers8

Linked

Related