Code becomes 6 times slower after doing extract method refactoring

Question

I know micro-benchmarking is hard. I'm not trying to build a poor micro-benchmark. Rather, I have run into this problem when making (what I thought to be) harmless refactoring. There is stripped down demo of the problem below.

The program builds an ArrayList of ten thousand random integers and then finds sum of the elements. In the example, summing is repeated a million times to improve signal v. noise ratio in the measurement of elapsed time. In real program, there's a million of slightly different lists, but the problematic effect holds regardless.

App#arraySumInlined is the method version before refactoring with summing kept inline in the loop body.
App#arraySumSubFunctionCall is the method version having loop body extracted into separate method.

Now, the surprising thing (to me) is that arraySumInlined takes ~7 sec, but arraySumSubFunctionCall takes ~42 sec. This seems to me like an impressive enough difference.

If I uncomment both arraySumInlined and arraySumSubFunctionCall then they complete in ~7 sec each. I.e. arraySumSubFunctionCall ceases to be that slow.

What's going on here? Are there any broader implications? E.g. never before have I thought of extract method refactoring as of something that could turn 7 sec method call into 42 sec one.

While researching this, I have found several questions involving JIT (e.g. Java method call performance and Why does this code using streams run so much faster in Java 9 than Java 8?), but they seem to deal with opposite cases: inlined code performing worse than code in a separate method.

Environment details: Windows 10 x64, Intel Core i3-6100.

λ java -version
openjdk version "11.0.4" 2019-07-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+11)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.4+11, mixed mode)

λ javac -version
javac 11.0.4

import java.util.ArrayList;
import java.util.Random;
import java.util.concurrent.TimeUnit;

public class App {

  public static void main(String[] args) {
    final int size = 10_000;
    final int iterations = 1_000_000;
    final var data = integerListWithRandomValues(size);

    //arraySumInlined(iterations, data);
    arraySumSubFunctionCall(iterations, data);
  }

  private static void arraySumSubFunctionCall(int iterations,
      final ArrayList<Integer> data) {
    final long start = System.nanoTime();
    long result = 0;
    for (int i = 0; i < iterations; ++i) {
      result = getSum(data);
    }
    final long end = System.nanoTime();
    System.out.println(String.format("%f sec (%d)",
        TimeUnit.NANOSECONDS.toMillis(end - start) / 1000.0, result));
  }

  private static void arraySumInlined(int iterations,
      final ArrayList<Integer> data) {
    final long start = System.nanoTime();
    long result = 0;
    for (int i = 0; i < iterations; ++i) {
      result = data.stream().mapToInt(e -> e).sum();
    }
    final long end = System.nanoTime();
    System.out.println(String.format("%f sec (%d)",
        TimeUnit.NANOSECONDS.toMillis(end - start) / 1000.0, result));
  }

  private static int getSum(final ArrayList<Integer> data) {
    return data.stream().mapToInt(e -> e).sum();
  }

  private static ArrayList<Integer> integerListWithRandomValues(final int size) {
    final var result = new ArrayList<Integer>();
    final var r = new Random();

    for (int i = 0; i < size; ++i) {
      result.add(r.nextInt());
    }

    return result;
  }
}

I haven't dug deep, but I'm pretty sure the reasons are somewhat similar to [this](https://stackoverflow.com/questions/54957772/lambda-performance-improvement-java-8-vs-11) or [this](https://stackoverflow.com/questions/25847397/erratic-performance-of-arrays-stream-map-sum) case. Try `-XX:MaxInlineLevel=15`. — apangin, Nov 15 '19 at 01:59
Might be specifically applicable until Java-11? With *OpenJDK 64-Bit Server VM (build 13+33)*, both the method seems to be taking the same time for execution (even if I run them individually.) on my machine, which is around ~8.7 sec though. — Naman, Nov 15 '19 at 02:39
It all boils down to the questions like, how good can the optimizer recognize that you are not using the result at all. — Holger, Nov 15 '19 at 15:21
@Holger Plausible, but not the case here. The effect still holds even if I mutate the list inside the loop and turn `result` into accumulator instead of effectively using only the value from the last iteration. — Ilya Kurnosov, Nov 15 '19 at 17:58
@apangin You're apparently right, setting `-XX:MaxInlineLevel=15` does make performance of both versions about the same. — Ilya Kurnosov, Nov 15 '19 at 18:10
In this case setting `-XX:InlineSmallCode` to a smaller value also seems to reduce run time. — jrook, Nov 15 '19 at 19:49
@IlyaKurnosov not using the end result is just the extreme case of redundant operations. Even when you use the end result, the Stream API still consists of a lot redundant work that can be eliminated if the optimizer’s horizon spans the entire operation, which is why it is not surprising that the inline level has such a big impact. — Holger, Nov 18 '19 at 09:06

score 0 · Answer 1 · answered Nov 15 '19 at 01:31

i did some experiments with your code, here are my conclusions :

1-if you put first arraySumSubFunctionCall() and then arraySumInlined() in your main() then execution time is back to different :

public static void main(String[] args) {
    ...
    arraySumSubFunctionCall(iterations, data);
    arraySumInlined(iterations, data); 
}

This means that JIT compiler optimization happens in arraySumInlined() and can be then be applied to arraySumSubFunctionCall().

2-if you replace your constant data.stream().mapToInt(e -> e).sum() by a really dynamic variable like new Random().nextInt() in both getSum() and arraySumInlined() then execution time is back to the same for arraySumSubFunctionCall() and arraySumInlined().

private static void arraySumInlined(int iterations,
      final ArrayList<Integer> data) {
    ...
    for (int i = 0; i < iterations; ++i) {
      result = new Random().nextInt();
    }
    ...
}


private static int getSum(final ArrayList<Integer> data) {
    return new Random().nextInt();
}

This means that the constant data.stream().mapToInt(e -> e).sum() is what is optimized in arraySumInlined() and then applied to arraySumSubFunctionCall().

In real life, i think that recalculating N time the same value in a local for loop doesn't happen so often, so you should'nt be scared by extract method refactoring, if needed for code readiness.

score 0 · Answer 2 · answered Nov 15 '19 at 02:28

For what it's worth, I did some experiments as well and found that it specifically is with the sum() method on the IntStream when executed inside the static method. I adjusted your code as follows so that I could get the average duration of each iteration:

import java.util.ArrayList;
import java.util.Random;
import java.util.concurrent.TimeUnit;
import java.util.stream.IntStream;

public class App2 {

    public static void main(String[] args) {
        final int size = 10_000;
        final int iterations = 1_000_000;
        final var data = integerListWithRandomValues(size);
        boolean inline = args.length > 0 && "inline".equalsIgnoreCase(args[0]);

        if (inline) {
            System.out.println("Running inline");
        } else {
            System.out.println("Running sub-function call");
        }

        arraySum(inline, iterations, data);
    }

    private static void arraySum(boolean inline, int iterations, final ArrayList<Integer> data) {
        long start;
        long result = 0;
        long totalElapsedTime = 0;

        for (int i = 0; i < iterations; ++i) {
            start = System.nanoTime();
            if (inline) {
                result = data.stream().mapToInt(e -> e).sum();
            } else {
                result = getIntStream(data).sum();
            }
            totalElapsedTime += getElapsedTime(start);
        }
        printElapsedTime(totalElapsedTime/iterations, result);
    }

    private static long getElapsedTime(long start) {
        return TimeUnit.NANOSECONDS.toNanos(System.nanoTime() - start);
    }

    private static void printElapsedTime(long elapsedTime, long result) {
        System.out.println(String.format("%d per iteration (%d)", elapsedTime, result));
    }

    private static IntStream getIntStream(final ArrayList<Integer> data) {
        return data.stream().mapToInt(e -> e);
    }

    private static int getSum(final ArrayList<Integer> data) {
        return data.stream().mapToInt(e -> e).sum();
    }

    private static ArrayList<Integer> integerListWithRandomValues(final int size) {
        final var result = new ArrayList<Integer>();
        final var r = new Random();

        for (int i = 0; i < size; ++i) {
            result.add(r.nextInt());
        }

        return result;
    }
}

Once I switched to the getIntStream() static method (after trying other permutations), the speed matched the inline execution time.

Code becomes 6 times slower after doing extract method refactoring

2 Answers2