26

EDIT: maaartinus gave the answer I was looking for and tmyklebu's data on the problem helped a lot, so thanks both! :)

I've read a bit about how HotSpot has some "intrinsics" that injects in the code, specially for Java standard Math libs (from here)

So I decided to give it a try, to see how much difference HotSpot could make against doing the comparison directly (specially since I've heard min/max can compile to branchless asm).

public class OpsMath {
    public static final int max(final int a, final int b) {
        if (a > b) {
            return a;
        }
        return b;
    }
}

That's my implementation. From another SO question I've read that using the ternary operator uses an extra register, I haven't found significant differences between doing an if block and using a ternary operator (ie, return ( a > b ) ? a : b ).

Allocating a 8Mb int array (ie, 2 million values), and randomizing it, I do the following test:

try ( final Benchmark bench = new Benchmark( "millis to max" ) )
    {
        int max = Integer.MIN_VALUE;

        for ( int i = 0; i < array.length; ++i )
        {
            max = OpsMath.max( max, array[i] );
            // max = Math.max( max, array[i] );
        }
    }

I'm using a Benchmark object in a try-with-resources block. When it finishes, it calls close() on the object and prints the time the block took to complete. The tests are done separately by commenting in/out the max calls in the code above.

'max' is added to a list outside the benchmark block and printed later, so to avoid the JVM optimizing the whole block away.

The array is randomized each time the test runs.

Running the test 6 times, it gives these results:

Java standard Math:

millis to max 9.242167 
millis to max 2.1566199999999998
millis to max 2.046396 
millis to max 2.048616  
millis to max 2.035761
millis to max 2.001044 

So fairly stable after the first run, and running the tests again gives similar results.

OpsMath:

millis to max 8.65418 
millis to max 1.161559  
millis to max 0.955851 
millis to max 0.946642 
millis to max 0.994543 
millis to max 0.9469069999999999 

Again, very stable results after the first run.

The question is: Why? Thats quite a big difference there. And I have no idea why. Even if I implement my max() method exactly like Math.max() (ie, return (a >= b) ? a : b ) I still get better results! It makes no sense.

Specs:

CPU: Intel i5 2500, 3,3Ghz. Java Version: JDK 8 (public march 18 release), x64. Debian Jessie (testing release) x64.

I have yet to try with 32 bit JVM.

EDIT: Self contained test as requested. Added a line to force the JVM to preload Math and OpsMath classes. That eliminates the 18ms cost of the first iteration for OpsMath test.

// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " + 
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
    
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
    int max = Integer.MIN_VALUE;
    
    // Randomize the array.
    for ( int i = 0; i < array.length; ++i )
    {
        array[i] = r.nextInt();
    }
    
    final long start = System.nanoTime();
    for ( int i = 0; i < array.length; ++i )
    {
        max = Math.max( array[i], max );
            // OpsMath.max() method implemented as described.
        // max = OpsMath.max( array[i], max );
    }
    // Calc time.
    final double end = (System.nanoTime() - start);
    // Store results.
    times.add( Double.valueOf( end ) );
    results.add( Integer.valueOf(  max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
    System.out.println( "IT" + i + " result: " + results.get( i ) );
    System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}

Java Math.max result:

IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047

OpsMath.max result:

IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984

Still the same overall results. I've tried with randomizing the array only once, and repeating the tests over the same array, I get faster results overall, but the same 2x difference between Java Math.max and OpsMath.max.

ThrowsError
  • 1,169
  • 1
  • 11
  • 43
TheStack
  • 553
  • 4
  • 12
  • 2
    I get essentially identical code and timings---admittedly on an older JVM and hardware. Post a complete, self-contained benchmark and as many details as you can about your environment if you want folks to look at it. Also, learn to use `-XX:+PrintAssembly`. It will save your ass when you get confused by things like this. – tmyklebu Mar 31 '14 at 01:47
  • All right I'll post a self contained test, I don't have a +PrintAssembly enabled JVM though. – TheStack Mar 31 '14 at 02:43
  • 3
    Is that `Benchmark` from `Caliper` (https://code.google.com/p/caliper/)? If not, read this: https://code.google.com/p/caliper/wiki/JavaMicrobenchmarks – David J. Liszewski Mar 31 '14 at 02:48
  • Posted self contained benchmark.It isn't Caliper's Benchmark. I know it isn't a representative case of real world usage (I certainly won't be calling max on 2 million integers any time soon). I'm just asking why the difference is so big (2x difference is quite important, and it scales if you use bigger array sizes). Its a pretty apples-to-apples comparison here, and the results are pretty much the same run after run. – TheStack Mar 31 '14 at 03:07
  • 2
    Please post the code you're running. `final double end = (System.nanoTime() - start;` won't compile, so this isn't what you ran. – Mike Samuel Mar 31 '14 at 03:15
  • added the parenthesis and changed "AltMath" to "OpsMath" if it makes such a difference :/ OpsMath is the real name of my math class. – TheStack Mar 31 '14 at 03:20
  • Yeah, this benchmark isn't measuring what you want it to measure. I'm about to add an answer regardless that picks apart the effects I see (factor two difference in the opposite direction). You should read about the things that go wrong when you try to benchmark Java code without knowing what you're doing; among other things, it's not clear to me that you're timing fully-optimised code rather than interpreted or OSR code. – tmyklebu Mar 31 '14 at 03:46
  • afaik the compiling threshold is 1 thousand invocations. whatever Math.max optimization HotSpot has under its belt, it should kick in here pretty fast since it doing a couple million calls to it. – TheStack Mar 31 '14 at 03:58
  • `Math.max` is far from the only thing that needs to be inlined and compiled properly. – tmyklebu Mar 31 '14 at 04:01
  • 1
    Out of sheer interest created a JMH-based benchmark - the same timings for Math.max and custom if-based and conditional-based implementations. As expected. – Oleg Estekhin Mar 31 '14 at 06:27
  • @OlegEstekhin: Would you mind sharing it? I've just [proposed to kick out the intrinsics](http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2014-April/013885.html). – maaartinus Apr 01 '14 at 12:25
  • By the way, none of those `final`s has any performance impact – Steve Kuo Apr 15 '14 at 16:14
  • yep I know, it's just an habit. specially on method parameters. – TheStack Apr 15 '14 at 22:01

4 Answers4

12

It's hard to tell why Math.max is slower than a Ops.max, but it's easy to tell why this benchmark strongly favors branching to conditional moves: On the n-th iteration, the probability of

Math.max( array[i], max );

being not equal to max is the probability that array[n-1] is bigger than all previous elements. Obviously, this probability gets lower and lower with growing n and given

final int[] array = new int[(8*1024*1024)/4];

it's rather negligible most of the time. The conditional move instruction is insensitive to the branching probability, it always take the same amount of time to execute. The conditional move instruction is faster than branch prediction if the branch is very hard to predict. On the other hand, branch prediction is faster if the branch can be predicted well with high probability. Currently, I'm unsure about the speed of conditional move compared to best and worst case of branching.1

In your case all but first few branches are fairly predictable. From about n == 10 onward, there's no point in using conditional moves as the branch is rather guaranteed to be predicted correctly and can execute in parallel with other instructions (I guess you need exactly one cycle per iteration).

This seems to happen for algorithms computing minimum/maximum or doing some inefficient sorting (good branch predictability means low entropy per step).


1 Both conditional move and predicted branch take one cycle. The problem with the former is that it needs its two operands and this takes additional instruction. In the end the critical path may get longer and/or the ALUs saturated while the branching unit is idle. Often, but not always, branches can be predicted well in practical applications; that's why branch prediction was invented in the first place.

As for the gory details of timing conditional move vs. branch prediction best and worst case, see the discussion below in comments. My my own benchmark shows that conditional move is significantly faster than branch prediction when branch prediction encounters its worst case, but I can't ignore contradictory results. We need some explanation for what exactly makes the difference. Some more benchmarks and/or analysis could help.

Community
  • 1
  • 1
maaartinus
  • 44,714
  • 32
  • 161
  • 320
  • i did thought about the probability of taking the same branch almost always after a few iterations, but i didn't knew that conditional moves were a bit slower than predicted branches (and that mispredicted branches were slower than conditional moves). that explains the time difference nicely, thanks! – TheStack Mar 31 '14 at 08:15
  • @TheStack: Some time ago I ran into a similar [funny thing](http://stackoverflow.com/questions/19689214/strange-branching-performance). The exact timing is hard to predict and JIT doesn't get it always right. I'm going to fix/extend my answer now. – maaartinus Mar 31 '14 at 08:34
  • 3
    *"while it's a bit slower than a (correctly) predicted branch, it's much faster than a mispredicted one"* My measurements say it is another way around: `cmov` is *much slower* if the branch can be predicted well whereas *only a bit faster* if the branch prediction fails. See [Why does tree vectorization make this sorting algorithm 2x slower?](http://stackoverflow.com/q/21055946/341970), the plot at the bottom of the question. To be clear here, I am only arguing "a bit" and "much" in the statement; as it stands now, I find the statement biased. Please remove "a bit" and "much". – Ali Mar 31 '14 at 13:15
  • 1
    @Ali: The branch misprediction penalty is something like 14 cycles, while both `cmov` and predicted branch can execute in a single cycle (they both may be free if there are free units at a proper time; this is where the branch has a better chance.). The results you linked ate interesting, but my above link demonstrates that the misprediction penaly is pretty high. I'll edit my post when I get time to think hard about the case you linked. – maaartinus Mar 31 '14 at 16:53
  • @Ali: Answer updated. The overall effect can be like described in your question, but the timing of individual instructions is as I wrote (maybe re-read Yakk's comments). Agreed? – maaartinus Apr 01 '14 at 05:43
  • @maaartinus No, sorry. Please re-read that question, starting from "Based on Yakk's comment, I created a test." As for your update: *"the branching probability itself converges quickly to either 0 or 1, just like in this question."* I think this is the part that you misunderstood. If you re-read the question, you will see that **the branching probability is `1-p` all the time.** Branch prediction obviously breaks down miserably around `0.5` but even then it is only 1.58x slower. **It is a maliciously hand-crafted worst-case situation to break down branch prediction as much as possible.** – Ali Apr 01 '14 at 11:04
  • @maaartinus We are having this discussion because I find your answer in general OK but overly biased, suggesting that `cmov` is the winner. It isn't. It's not at all a clear cut even in the generic situation whether branch prediction or `cmov` is "better". Please *either* remove the overly biased statements, suggesting that `cmov` so much better than branch prediction and make the answer neutral w.r.t. branch prediction *or* create a not too twisted benchmark that breaks down branch prediction more than my example. We wouldn't be having this discussion if I didn't find your answer valuable. – Ali Apr 01 '14 at 11:13
  • @Ali: I indeed have to re-read your question as I indeed missed the "branching probability is 1-p" part. I'm **not** claiming that `cmov` is always the winner. I know that there are cases when it loses and they're not rare (such is life, otherwise making a good compiler would be way easier). My [benchmark results](http://stackoverflow.com/questions/19689214/strange-branching-performance) contradict yours, so I'm unsure what's right. – maaartinus Apr 01 '14 at 11:51
  • @maaartinus Maybe JIT / Java adds noise to your benchmark? My example was written in C++ where I find it is easier to reason about execution times. Unfortunately, your benchmark is far from trivial; I would have to re-write that in C++ and analyze it in great detail to understand what's going on in your code. – Ali Apr 01 '14 at 11:58
  • @maaartinus We could end this discussion if you allowed me to make small changes to your answer and tone it down. The only thing that I find problematic is that it is still easy to misinterpret your answer and after reading it, one would think that `cmov` is so much better than branch prediction. If I could make it neutral, we would be OK. Otherwise, I need a not too twisted benchmark. In my example, it clearly boils down to 3 instructions, given in the question, side-by-side. That's all. In your example, honestly, I don't see what's going on and it would take me hours to track it down. :( – Ali Apr 01 '14 at 12:14
  • @Ali: My benchmark could be simplified to `for (boolean x : arr) sum += x ? 1 : 0`, but in C++ the fact that bool is actually a small int could come in the way. But something like `for (int x : ar) sum += x>0 ? x : 0` should do. I've observed similar behavior in many benchmarks, so I don't think it won't happen in a very simple one. Yes, feel free to edit my answer... though I thought I already made it neutral. Would you mind posting a question like "How big is the misprediction penalty" to let people help us to resolve our discussion? – maaartinus Apr 01 '14 at 12:21
  • @maaartinus *"Would you mind posting a question like "How big is the misprediction penalty" to let people help us to resolve our discussion?"* Yes, it is on my agenda but it won't happen in the near future because I am way too busy to deal with this question. Sorry. :( It would be interesting though. – Ali Apr 01 '14 at 16:02
  • @maaartinus OK, I have changed your answer. I hope you don't mind and I also hope I have made your already good answer even better. As for your benchmark, it doesn't seem too complicated. Unfortunately, I won't have time to look at it before the week-end / early next week. If it could wait a little, I would appreciate it. Oh, and one more thing popped up: Branch prediction happens in the hardware. Perhaps it is the difference in hardware that causes the difference in timings in case of the two benchmarks (yours and mine). – Ali Apr 01 '14 at 16:30
  • @maaartinus If you can wait a week or two, I will do measurements on my laptop with your example and get back to you with the results. Maybe it all boils down to difference in hardware. – Ali Apr 01 '14 at 16:31
  • @Ali: Don't hurry. I slightly disagree with some things in your edit, but really only slightly. Let's look what we find out next week. I don't think HW can make such a big difference unless you're running some 286. The quality of the branch predictor doesn't matter as no smart algorithm can help (it's just random). And the pipeline is pretty deep. – maaartinus Apr 01 '14 at 16:49
  • @maaartinus OK, thanks. Let's freeze this discussion until further measurements. – Ali Apr 01 '14 at 17:08
3

When I run your (suitably-modified) code using Math.max on an old (1.6.0_27) JVM, the hot loop looks like this:

0x00007f4b65425c50: mov    %r11d,%edi         ;*getstatic array
                                              ; - foo146::bench@81 (line 40)
0x00007f4b65425c53: mov    0x10(%rax,%rdx,4),%r8d
0x00007f4b65425c58: mov    0x14(%rax,%rdx,4),%r10d
0x00007f4b65425c5d: mov    0x18(%rax,%rdx,4),%ecx
0x00007f4b65425c61: mov    0x2c(%rax,%rdx,4),%r11d
0x00007f4b65425c66: mov    0x28(%rax,%rdx,4),%r9d
0x00007f4b65425c6b: mov    0x24(%rax,%rdx,4),%ebx
0x00007f4b65425c6f: rex mov    0x20(%rax,%rdx,4),%esi
0x00007f4b65425c74: mov    0x1c(%rax,%rdx,4),%r14d  ;*iaload
                                              ; - foo146::bench@86 (line 40)
0x00007f4b65425c79: cmp    %edi,%r8d
0x00007f4b65425c7c: cmovl  %edi,%r8d
0x00007f4b65425c80: cmp    %r8d,%r10d
0x00007f4b65425c83: cmovl  %r8d,%r10d
0x00007f4b65425c87: cmp    %r10d,%ecx
0x00007f4b65425c8a: cmovl  %r10d,%ecx
0x00007f4b65425c8e: cmp    %ecx,%r14d
0x00007f4b65425c91: cmovl  %ecx,%r14d
0x00007f4b65425c95: cmp    %r14d,%esi
0x00007f4b65425c98: cmovl  %r14d,%esi
0x00007f4b65425c9c: cmp    %esi,%ebx
0x00007f4b65425c9e: cmovl  %esi,%ebx
0x00007f4b65425ca1: cmp    %ebx,%r9d
0x00007f4b65425ca4: cmovl  %ebx,%r9d
0x00007f4b65425ca8: cmp    %r9d,%r11d
0x00007f4b65425cab: cmovl  %r9d,%r11d         ;*invokestatic max
                                              ; - foo146::bench@88 (line 40)
0x00007f4b65425caf: add    $0x8,%edx          ;*iinc
                                              ; - foo146::bench@92 (line 39)
0x00007f4b65425cb2: cmp    $0x1ffff9,%edx
0x00007f4b65425cb8: jl     0x00007f4b65425c50

Apart from the weirdly-placed REX prefix (not sure what that's about), here you have a loop that's been unrolled 8 times that does mostly what you'd expect---loads, comparisons, and conditional moves. Interestingly, if you swap the order of the arguments to max, here it outputs the other kind of 8-deep cmovl chain. I guess it doesn't know how to generate a 3-deep tree of cmovls or 8 separate cmovl chains to be merged after the loop is done.

With the explicit OpsMath.max, it turns into a ratsnest of conditional and unconditional branches that's unrolled 8 times. I'm not going to post the loop; it's not pretty. Basically each mov/cmp/cmovl above gets broken into a load, a compare and a conditional jump to where a mov and a jmp happen. Interestingly, if you swap the order of the arguments to max, here it outputs an 8-deep cmovle chain instead. EDIT: As @maaartinus points out, said ratsnest of branches is actually faster on some machines because the branch predictor works its magic on them and these are well-predicted branches.

I would hesitate to draw conclusions from this benchmark. You have benchmark construction issues; you have to run it a lot more times than you are and you have to factor your code differently if you want to time Hotspot's fastest code. Beyond the wrapper code, you aren't measuring how fast your max is, or how well Hotspot understands what you're trying to do, or anything else of value here. Both implementations of max will result in code that's entirely too fast for any sort of direct measurement to be meaningful within the context of a larger program.

tmyklebu
  • 13,915
  • 3
  • 28
  • 57
  • So what you're saying is that the timing results are a consequence of how the OpsMath.max changes the JIT output rather than any inherent property of it? (ie, it isn't faster it just changes the environment enough that times are sampled in a way that it looks faster). – TheStack Mar 31 '14 at 04:50
  • 1
    I don't think so. I believe that your benchmark precisely measures.... the speed of your benchmark. For something more reality-related you really need either JMH or Caliper. Getting a Java microbenchmark right is way [harder then you think](https://code.google.com/p/caliper/wiki/JavaMicrobenchmarks). But at the first glance I can't see any mistake, you seem to did it well. – maaartinus Mar 31 '14 at 06:29
  • 1
    @TheStack: `max` runs way too fast to be able to benchmark it in isolation on modern pipelined computers. The performance of a given `max` implementation will depend much more strongly on the instructions *around* it than the instructions *in* it. maaartinus seems to have identified the reason for the change; the ugly ratsnest of branches winds up being *faster* here because only log(n) of them ever get taken. – tmyklebu Mar 31 '14 at 14:28
  • 1
    @TheStack: While this is illustrative in the case of "find the max of my huge array in random order", it doesn't tell you what happens for "find the max of my huge array in some other order" or "which `max` should I use in my nontrivial piece of code." – tmyklebu Mar 31 '14 at 14:30
1

Using JDK 8:

java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

On Ubuntu 13.10

I ran the following:

import java.util.Random;
import java.util.function.BiFunction;

public class MaxPerformance {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array;

  public MaxPerformance(BiFunction<Integer, Integer, Integer> max, int[] array) {
    this.max = max;
    this.array = array;
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(m, array[i]);

    m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(array[i], m);

    // total time over number of calls to max
    return ((double) (System.nanoTime() - start)) / (double) array.length / 2.0;
  }

  public double averageTime(int repeats) {
    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
      cumulativeTime += time();
    return (double) cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array = new int[size];
    for (int i = 0; i < size; i++) array[i] = random.nextInt();

    double tMath = new MaxPerformance(Math::max, array).averageTime(100);
    double tAlt1 = new MaxPerformance(MaxPerformance::max1, array).averageTime(100);
    double tAlt2 = new MaxPerformance(MaxPerformance::max2, array).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

And I got the following results (average nanoseconds taken for each call to max):

Java Math: 15.443555810000003
Alt 1:     14.968298919999997
Alt 2:     16.442204045

So on a long run it looks like the second implementation is the fastest, although by a relatively small margin.

In order to have a slightly more scientific test, it makes sense to compute the max of pairs of elements where each call is independent from the previous one. This can be done by using two randomized arrays instead of one as in this benchmark:

import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance2 {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array1, array2;

  public MaxPerformance2(BiFunction<Integer, Integer, Integer> max, int[] array1, int[] array2) {
    this.max = max;
    this.array1 = array1;
    this.array2 = array2;
    if (array1.length != array2.length) throw new IllegalArgumentException();
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array1.length; ++i) m = max.apply(array1[i], array2[i]);
    m += m; // to avoid optimizations!

    return ((double) (System.nanoTime() - start)) / (double) array1.length;
  }

  public double averageTime(int repeats) {
    // warm up rounds:
    double tmp = 0;
    for (int i = 0; i < 10; i++) tmp += time();
    tmp *= 2.0;

    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
        cumulativeTime += time();
    return cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array1 = new int[size];
    int[] array2 = new int[size];
    for (int i = 0; i < size; i++) {
        array1[i] = random.nextInt();
        array2[i] = random.nextInt();
    }

    double tMath = new MaxPerformance2(Math::max, array1, array2).averageTime(100);
    double tAlt1 = new MaxPerformance2(MaxPerformance2::max1, array1, array2).averageTime(100);
    double tAlt2 = new MaxPerformance2(MaxPerformance2::max2, array1, array2).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

Which gave me:

Java Math: 15.346468170000005
Alt 1:     16.378737519999998
Alt 2:     20.506475350000006

The way your test is set up makes a huge difference on the results. The JDK version seems to be the fastest in this scenario. This time by a relatively large margin compared to the previous case.

Somebody mentioned Caliper. Well if you read the wiki, one the first things they say about micro-benchmarking is not to do it: this is because it's hard to get accurate results in general. I think this is a clear example of that.

Giovanni Botta
  • 9,626
  • 5
  • 51
  • 94
  • (1) Have you inspected the assembly code to make sure it's inlining the target of the `BiFunction` call in both cases? (2) You sure you aren't timing interpreted or OSR code? I'm almost sure you are. – tmyklebu Mar 31 '14 at 14:35
  • (1) That's a good point. I haven't. (2) I tried a different version with warm up rounds and I get the same results. The bottom line is that the test conditions are the same for all functions. I thought that might give a more "sterile" environment. However I might be terribly wrong! This is the pain of microbenchmarking. – Giovanni Botta Mar 31 '14 at 14:38
  • OSR = On Stack Replacement: http://www.azulsystems.com/blog/cliff/2011-11-22-what-the-heck-is-osr-and-why-is-it-bad-or-good – Fortega Mar 31 '14 at 14:42
  • Yeah I looked it up and tried testing with a bunch of warmup rounds (calls to the `time()` function that don't add to the cumulative sum). I get the same. – Giovanni Botta Mar 31 '14 at 14:43
  • The intention was correct: The timing should be *averaged* to smoot out potential "random" spikes. However, there are (at least) 2 issues: 1. You ignored the result (`m`) in the first test - so the functions *could* have been optimized away (you addressed this with your EDIT). And 2: Running the tests only once in predefined order could introduce distortions, because in the first run, the BiFunction is monomorphic (there is only one implementation), in the second it is bimorphic, and the third one is polymorphic (see http://www.javaspecialists.eu/archive/Issue158.html ). – Marco13 Mar 31 '14 at 14:44
  • @Marco13 Great comment! How would you recommend addressing these issues? The tests could be run 3! times (one for each permutation of the three methods) to address #2. Would that do it? – Giovanni Botta Mar 31 '14 at 14:49
  • @tmyklebu How would I make sure the assembly code gets inlined? – Giovanni Botta Mar 31 '14 at 14:50
  • First of all: In this example, I did not find a noticable difference between the mono/polymorphic cases. This might be due to some neat JIT tricks in Java8, because usually, this can make a *significant* difference. However, the *order* of the calls made a difference. The simplest soution for issue #2 would thus be to run only isolated tests for each case. (I think that in Caliper, each test is run by launching a new instance of the JVM!). ... ... ... – Marco13 Mar 31 '14 at 15:19
  • ... Alternatively, one could pull out the code of the `main` into an own method, and call this method several times, possibly with an increasing `size` parameter, and ignoring the results of the first few runs. The order itself should then no longer be so important. But as you pointed out in your reference to Caliper: Even then it's hard to make a really *meaningful* benchmark here. For example, the results also differ significantly with the numbers that are contained in the array (e.g. when you use `random.nextInt(10)`) - may be related to http://stackoverflow.com/questions/11227809 ...? – Marco13 Mar 31 '14 at 15:20
0

Here's a branchless min operation, max can be implemented by replacing diff=a-b with diff=b-a.

public static final long min(final long a, final long b) {
    final long diff = a - b;
    // All zeroes if a>=b, all ones if a<b because the sign bit is propagated
    final long mask = diff >> 63;
    return (a & mask) | (b & (~mask));
}

It should be as fast as streaming the memory because the CPU operations should be hidden by the sequential memory read latency.

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158