Java 8 nested loops with streams & performance

Question

In order to practise the Java 8 streams I tried converting the following nested loop to the Java 8 stream API. It calculates the largest digit sum of a^b (a,b < 100) and takes ~0.135s on my Core i5 760.

public static int digitSum(BigInteger x)
{
    int sum = 0;
    for(char c: x.toString().toCharArray()) {sum+=Integer.valueOf(c+"");}
    return sum;
}

@Test public void solve()
    {
        int max = 0;
        for(int i=1;i<100;i++)
            for(int j=1;j<100;j++)
                max = Math.max(max,digitSum(BigInteger.valueOf(i).pow(j)));
        System.out.println(max);
    }

My solution, which I expected to be faster because of the paralellism actually took 0.25s (0.19s without the parallel()):

int max =   IntStream.range(1,100).parallel()
            .map(i -> IntStream.range(1, 100)
            .map(j->digitSum(BigInteger.valueOf(i).pow(j)))
            .max().getAsInt()).max().getAsInt();

My questions

did I do the conversion right or is there a better way to convert nested loops to stream calculations?
why is the stream variant so much slower than the old one?
why did the parallel() statement actually increased the time from 0.19s to 0.25s?

I know that microbenchmarks are fragile and parallelism is only worth it for big problems but for a CPU, even 0.1s is an eternity, right?

Update

I measure with the Junit 4 framework in Eclipse Kepler (it shows the time taken for executing a test).

My results for a,b<1000 instead of 100:

traditional loop 186s
sequential stream 193s
parallel stream 55s

Update 2 Replacing sum+=Integer.valueOf(c+""); with sum+= c - '0'; (thanks Peter!) shaved off 10 whole seconds of the parallel method, bringing it to 45s. Didn't expect such a big performance impact!

Also, reducing the parallelism to the number of CPU cores (4 in my case) didn't do much as it reduced the time only to 44.8s (yes, it adds a and b=0 but I think this won't impact the performance much):

int max = IntStream.range(0, 3).parallel().
          .map(m -> IntStream.range(0,250)
          .map(i -> IntStream.range(1, 1000)
          .map(j->.digitSum(BigInteger.valueOf(250*m+i).pow(j)))
          .max().getAsInt()).max().getAsInt()).max().getAsInt();

How do you measure? As you point out, without appropriate care, the results of a micro benchmark can be misleading. — assylias, Feb 23 '14 at 13:42
I would replace `sum+=Integer.valueOf(c+"");` with `sum+= c - '0';` as this will be much faster. — Peter Lawrey, Feb 23 '14 at 13:48
FWIW you can replace the loop in `digitSum` with a stream using the `CharSequence.chars()` method. It avoids allocating the char array. — Stuart Marks, Feb 23 '14 at 17:14

assylias · Accepted Answer · 2014-02-23T14:15:49.283

I have created a quick and dirty micro benchmark based on your code. The results are:

loop: 3192
lambda: 3140
lambda parallel: 868

So the loop and lambda are equivalent and the parallel stream significantly improves the performance. I suspect your results are unreliable due to your benchmarking methodology.

public static void main(String[] args) {
    int sum = 0;

    //warmup
    for (int i = 0; i < 100; i++) {
        solve();
        solveLambda();
        solveLambdaParallel();
    }

    {
        long start = System.nanoTime();
        for (int i = 0; i < 100; i++) {
            sum += solve();
        }
        long end = System.nanoTime();
        System.out.println("loop: " + (end - start) / 1_000_000);
    }
    {
        long start = System.nanoTime();
        for (int i = 0; i < 100; i++) {
            sum += solveLambda();
        }
        long end = System.nanoTime();
        System.out.println("lambda: " + (end - start) / 1_000_000);
    }
    {
        long start = System.nanoTime();
        for (int i = 0; i < 100; i++) {
            sum += solveLambdaParallel();
        }
        long end = System.nanoTime();
        System.out.println("lambda parallel : " + (end - start) / 1_000_000);
    }
    System.out.println(sum);
}

public static int digitSum(BigInteger x) {
    int sum = 0;
    for (char c : x.toString().toCharArray()) {
        sum += Integer.valueOf(c + "");
    }
    return sum;
}

public static int solve() {
    int max = 0;
    for (int i = 1; i < 100; i++) {
        for (int j = 1; j < 100; j++) {
            max = Math.max(max, digitSum(BigInteger.valueOf(i).pow(j)));
        }
    }
    return max;
}

public static int solveLambda() {
    return  IntStream.range(1, 100)
            .map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
            .max().getAsInt();
}

public static int solveLambdaParallel() {
    return  IntStream.range(1, 100)
            .parallel()
            .map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
            .max().getAsInt();
}

I have also run it with jmh which is more reliable than manual tests. The results are consistent with above (micro seconds per call):

Benchmark                                Mode   Mean        Units
c.a.p.SO21968918.solve                   avgt   32367.592   us/op
c.a.p.SO21968918.solveLambda             avgt   31423.123   us/op
c.a.p.SO21968918.solveLambdaParallel     avgt   8125.600    us/op

I would be interested to see what you get if you run the tests in reverse order. — Peter Lawrey, Feb 23 '14 at 13:55
@PeterLawrey Same results (lambda parallel : 836, lambda: 3124, loop: 3184) — assylias, Feb 23 '14 at 13:56
Interesting results! Maybe mine are also due to Intel Turbo Boost (automatic overclocking if only one core is used)? However I am not so sure if Junit is really that unreliably at timekeeping because I repeated it several times and always go similar results. — Konrad Höffner, Feb 23 '14 at 14:08
@kirdie junit is reliable, but down't warm up the code for you. Warmed-up code can be many times faster than the first time you run it. — Peter Lawrey, Feb 23 '14 at 16:20
@delive [no!](http://stackoverflow.com/questions/20375176/should-i-always-use-a-parallel-stream-when-possible) - it happens to be faster for this specific code but that is not always true. — assylias, Dec 21 '15 at 10:20
http://stackoverflow.com/questions/24027247/java-8-streams-serial-vs-parallel-performance — , Dec 21 '15 at 11:06

Peter Lawrey · Answer 2 · 2014-02-23T13:53:11.850

The problem you have is you are looking at sub-optimal code. When you have code which might be heavily optimised you are very dependant on whether the JVM is smart enough to optimise your code. Loops have been around much longer and are better understood.

One big difference in your loop code, is you working set is very small. You are only considering one maximum digit sum at a time. This means the code is cache friendly and you have very short lived objects. In the stream() case you are building up collections for which there more in the working set at any one time, using more cache, with more overhead. I would expect your GC times to be longer and/or more frequent as well.

why is the stream variant so much slower than the old one?

Loops are fairly well optimised having been around since before Java was developed. They can be mapped very efficiently to hardware. Streams are fairly new and not as heavily optimised.

why did the parallel() statement actually increased the time from 0.19s to 0.25s?

Most likely you have a bottle neck on a shared resource. You create quite a bit of garbage but this is usually fairly concurrent. Using more threads, only guarantees you will have more overhead, it doesn't ensure you can take advantage of the extra CPU power you have.

Hm but looking at my code I don't see any shared resource, I just use the Java library or do I overlook something? — Konrad Höffner, Feb 23 '14 at 13:54
@kirdie You have hardware resources you are sharing. e.g. your L3 cache, possibly L1/L2 cache, your main memory, and the garbage collector may play a part. — Peter Lawrey, Feb 23 '14 at 16:18

Java 8 nested loops with streams & performance

2 Answers2

Linked