13

In reference to this question, the answer specify that the unsorted array takes more time because it fails the branch prediction test. but if we make a minor change in the program:

import java.util.Arrays;
import java.util.Random;


public class Main{

    public static void main(String[] args) {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c) {
            data[c] = rnd.nextInt() % 256;
        }

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i) {
            // Primary loop
            for (int c = 0; c < arraySize; ++c) {
                if (data[c] >= 128) {
                    sum = data[c];
                }
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

here I have replaced (from original question)

if (data[c] >= 128) 
    sum += data[c];

with

if (data[c] >= 128) 
    sum = data[c];

the unsorted array gives approx. the same result, I want to ask why branch prediction is not working in this case?

Community
  • 1
  • 1
MYK
  • 825
  • 12
  • 25
  • On my PC sorted array runs about 3 times faster. Which results do you get? I don't see anything strange. – Mikhail Jan 29 '14 at 13:30
  • 1
    which results are you comparing? – MYK Jan 29 '14 at 13:31
  • Sorted and unsorted arrays. And you? ;) – Mikhail Jan 29 '14 at 13:34
  • 1
    I've using sorted and unsorted arrays too.. taking about ~5 seconds on Core i5-3230M, usually unsorted is even faster, but that can be ignored because of randomness... – MYK Jan 29 '14 at 13:37
  • 4
    Please... that is not how you measure *anything* on the JVM. Java is not C, it has a very complex *runtime* which JIT-compiles the code at some point, which is not the initial point---and a myriad of other concerns which impact the measurement. – Marko Topolnik Jan 29 '14 at 13:39
  • 1
    i'm not measuring anything, on the context of question referenced, the answer provides a very definite explanation that delay is due to the branch prediction failure, if that's the cause than why branch prediction is not working in this scenario? – MYK Jan 29 '14 at 13:44
  • Your answer works for C, it does not work for Java. Java does not run on a processor that does branch prediction. Java runs on a virtual computing machine defined in behavior by the Java specifications. – zapl Jan 29 '14 at 13:45
  • 3
    @zapl "*Java does not run on a processor that does branch prediction*" => of course it does (or to be pedantic: the JVM that executes the java code does)! – assylias Jan 29 '14 at 13:50
  • @MarkoTopolnik although I agree with your general comment I don't understand the behaviour... – assylias Jan 29 '14 at 13:54
  • @assylias Yes that's the pedantic view but it is IMO the way you should think of Java. Especially if you try to apply low level hardware optimizations to a language that deliberately abstracts low level differences. – zapl Jan 29 '14 at 14:02
  • @assylias It is not clear whether OP has successfully reproduced the result from the original question. He only says what happened after the change. – Marko Topolnik Jan 29 '14 at 14:04
  • 1
    @zapl branch prediction can have [an impact on a Java application](http://stackoverflow.com/a/12480386/829571). I'm not saying it should be the primary concern but you can't say that it is not applicable to Java either. – assylias Jan 29 '14 at 14:06
  • 2
    @MarkoTopolnik I can reproduce what he describes: using `sum += data[c]` with or without sorting shows a perf difference (sorting 3x faster) but using `sum = data[c]` with or without sorting shows no perf difference. – assylias Jan 29 '14 at 14:07
  • 1
    @assylias I reproduce as well. In addition, I find a significant systematic difference *between trials* (JVM runs), apparenly depending on the exact data in the array. The timing *within a trial* is stable. For example: 17 ms, then 20 ms, then 24 ms. – Marko Topolnik Jan 29 '14 at 14:09
  • @assylias Forget sorting: use `data[c] = (c/8) % 256` and compare with `(c*31) % 256;` – Marko Topolnik Jan 29 '14 at 14:14
  • @assylias I didn't try to say it can't. But how / where / when is mostly up to the JVM because it can re-arrange your code in ways you did not write (e.g. add implicit array index checking branches which might interrupt the branch prediction happening in pure C or even skip executing code because it has no semantic (side)effect.) – zapl Jan 29 '14 at 14:15
  • @Mikhail HotSpot 7 64-bit, testing with `jmh`, not with OP's actual code. – Marko Topolnik Jan 29 '14 at 14:19
  • I'm getting even stranger results. Repeatably. Unsorted with +=: 9.8seconds; Unsorted with =: 3 seconds; Sorted with += 2 seconds; Sorted with =: 3 seconds. And warming up by running the loop a few times first doesn't impact this result significantly. (jdk1.7.0_17 on MacOS) – Erwin Bolwidt Jan 29 '14 at 14:21
  • On x32 1.6 and 1.7 it's not reproducible. – Mikhail Jan 29 '14 at 14:21
  • @ElliottFrisch true but I don't think that this is what happens here. – assylias Jan 29 '14 at 15:41
  • @ElliottFrisch Since the effect is fully reproducible after entirely eliminating the outer loop, the conclusion is that this loop rearrangement either doesn't happen or doesn't account for any speed advantage. – Marko Topolnik Jan 29 '14 at 21:15

1 Answers1

10

I have used jmh to analyze this. Here is my code:

@OutputTimeUnit(TimeUnit.MICROSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 2, time = 1)
@Measurement(iterations = 3, time = 1)
@State(Scope.Thread)
@Fork(2)
public class Comparison
{
  static final int SIZE = 1<<15;
  final int[] data = new int[SIZE];

  @Setup
  public void setup() {
    int i = 1;
    for (int c = 0; c < SIZE; ++c) data[c] = (i*=611953);
    for (int c = 0; c < SIZE; ++c) data[c] = data[c] >= 128? 128 : 127;
  }

  @GenerateMicroBenchmark
  public long sum() {
    long sum = 0;
    for (int c = 0; c < SIZE; ++c) if (data[c] >= 128) sum += data[c];
    return sum;
  }
}

Notice I don't use either sorting or random number generation; they are an unnecessary complication. With the formula used in the above code:

data[c] = (i*=611953);

I get 132 µs runtime. If I comment out the line involving

data[c] = data[c] >= 128? 128 : 127;

the time doesn't change at all. This eliminates all arithmetic considerations and focuses on branch prediction. If I use

data[c] = 127;

I get 13 µs, and if I use

data[c] = 128;

I get 16 µs. That's the "base case", stressing the difference between constant branching decisions.

My conclusion: this is definitely the effect of low-level branch prediction.

Could the JIT reverse the loop?

Let's analyze your intervention now. If I use the formula as presented in my code above, but change

if (data[c] >= 128) sum += data[c];

to

if (data[c] >= 128) sum = data[c];

then the timing indeed drops from 132 µs to 27 µs.

This is my guess at explaining the drop: an optimizing trick the JIT compiler can do is to reverse the direction of the loop. Now your code becomes

for (int c = SIZE-1; c <= 0; --c) if (data[c] >= 128) { sum = data[c]; break; }

the loop has been short-circuited to the minimum number of iterations needed to reach the same outcome as the original loop.

I added this

data[SIZE-1] = 128;

to the end of the setup() method, but it didn't change the timing. That would seem to invalidate the naïve version of the "loop reversal" conjecture.

No, it's probably cmovl

In analyzing assembly I find this:

cmp edx, 0x80
cmovl eax, ebx

cmovl is a conditional move instruction which will perform the effect of the assignment happening in the then branch, but without involving any jumps, therefore eliminating any penalty associated with branch prediction failure. This is a good explanation of the actual effect.

Community
  • 1
  • 1
Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • @assylias I have substantially changed my answer. – Marko Topolnik Jan 29 '14 at 15:26
  • Well spotted detective. CMOVL definitely removes the branch. Smart JIT! – assylias Jan 29 '14 at 16:46
  • @assylias: Smart JIT, but [only sometimes](http://stackoverflow.com/questions/19689214/strange-branching-performance). CMOVL could be used be for `+=` as well, and I'd say that JIT does it sometimes, and the decision isn't optimal. – maaartinus Jan 29 '14 at 19:17
  • @maaartinus Indeed - I remember that post. – assylias Jan 29 '14 at 19:22
  • @maaartinus There is a hint of that in the post linked to by OP in this question. In that question, a respondent in the comment said he is not observing the difference at all, with an appropriate -O setting on gcc. – Marko Topolnik Jan 29 '14 at 19:31
  • That corresponds with my question assuming JIT choosing between branch and conditional move based on profiling (which is very good) but grossly underestimating the cost of branching (which is pretty bad). In the comment you kindly pointed me to, `-O3` leads to conditional move, while `-O2` leads to branching. Not sure if such an behavior makes sense. – maaartinus Jan 29 '14 at 19:52
  • Good catch. I believe a decent c/c++ compiler would have done something similar. – Leeor Feb 02 '14 at 20:55