1

Despite "warming up" via repetitions, and a bit of complexity added to the algorithm, parallelSetAll() seems consistently slower here. I'm not really trying to micro-benchmark here, just getting a rough feel for what's happening.

import java.util.*;
import java.time.*;

public class SlowParallelSetAll {
  static final int SIZE = 20_000_000;
  static long timeIt(Runnable test) {
    Instant start = Instant.now();
    test.run();
    long millis = Duration.between(start, Instant.now()).toMillis();
    System.out.println(millis);
    return millis;
  }
  public static void main(String[] args) {
    int reps = 10;
    long[] la = new long[SIZE];
    for(int i = 0; i < reps; i++)
      timeIt(() -> Arrays.setAll(la, n -> n * n * 11 + n * 7));
    System.out.println("###");
    for(int i = 0; i < reps; i++)
      timeIt(() -> Arrays.parallelSetAll(la, n -> n * n * 11 + n * 7));
  }
}
/* Output:
38
37
35
34
35
34
35
34
34
35
###
52
42
43
44
46
46
44
44
43
43
*/

The (lambda expression) algorithm should be independent as it only relies on the index value n, and therefore seems like it should be easily parallelizable.

Moving the calls around, alternating the two approaches, etc. does produce different results, but it also seems like there's something more here than just micro-benchmarking noise. For one thing, I was expecting bigger time differences between the normal and parallel versions. Also, it seems like one might stumble across situations where the parallel version might seem like the right choice but the normal version would actually be more appropriate. Basically I'm looking for some insights on this---including whether there's some easy way to show that what I'm seeing is purely a micro-benchmarking phenomenon.

For what it's worth, here it's rewritten to use System.nanoTime() and just alternating the tests. These results seem reasonable, although the whole microbenchmarking issue is daunting:

import java.util.*;
import java.time.*;
import java.util.concurrent.*;

public class SlowParallelSetAll {
  static final int SIZE = 20_000_000;
  static long timeIt(Runnable test) {
    long start = System.nanoTime();
    test.run();
    long delta = System.nanoTime() - start;
    long millis = TimeUnit.NANOSECONDS.toMillis(delta);
    System.out.println(millis);
    return millis;
  }
  public static void main(String[] args) {
    int reps = 10;
    long[] la = new long[SIZE];
    for(int i = 0; i < reps; i++) {
      timeIt(() -> Arrays.parallelSetAll(la, n -> n * n * 11 + n * 7));
      timeIt(() -> Arrays.setAll(la, n -> n * n * 11 + n * 7));
    }
  }
}
/* Output:
41
74
41
73
41
67
40
67
40
67
41
67
41
67
40
67
40
67
40
67
*/
user1677663
  • 1,431
  • 15
  • 21
  • how much slower? try reversing the order of the calls and the executing the calls alternately, then reverse that order. You'll often get different results. – JimmyJames Jan 21 '16 at 16:48
  • What kind of system are you executing this on? If you only have one processor, for example, you'd expect that a parallel algorithm would always be slower than the serial version since you pay the parallelism overhead without getting any benefit. – jacobm Jan 21 '16 at 17:02
  • 1
    Measuring the timing with an Instant is probably not such a good idea. You should use nanotime. Or even better, use a proper benchmarking libray, such as jmh. – assylias Jan 21 '16 at 17:09
  • Sorry about that. 4 cores, 8 hardware threads. Re Instant vs nanotime: can you point me somewhere that shows why one is better than the other? Thanks. – user1677663 Jan 21 '16 at 18:04
  • 1
    @user1677663 `Instant.now()` uses `System.curretTimeMillis()` under the hood: http://stackoverflow.com/a/1776053/829571 – assylias Jan 21 '16 at 18:22

1 Answers1

4

Using jmh, I get the following results (size is the length of the array). The score is the runtime per call, in micro seconds (smaller = better).

Benchmark                     (size)  Mode  Cnt       Score       Error  Units
SO34929316.parallelSetAll          1  avgt   20       0.077 ±     0.003  us/op
SO34929316.parallelSetAll       1000  avgt   20       9.935 ±     0.478  us/op
SO34929316.parallelSetAll     100000  avgt   20      56.024 ±     7.008  us/op
SO34929316.parallelSetAll    1000000  avgt   20     518.495 ±    75.331  us/op
SO34929316.parallelSetAll   10000000  avgt   20    5640.574 ±   139.324  us/op
SO34929316.setAll                  1  avgt   20       0.016 ±     0.001  us/op
SO34929316.setAll               1000  avgt   20       1.257 ±     0.023  us/op
SO34929316.setAll             100000  avgt   20     116.760 ±     3.116  us/op
SO34929316.setAll            1000000  avgt   20    1168.868 ±    42.153  us/op
SO34929316.setAll           10000000  avgt   20   12347.399 ±   766.931  us/op

As you would expect, parallelSetAll is faster for larger arrays and slower for smaller arrays. However the speedup factor is only ~ x2 according to the test (ran on Win 8.1 / Xeon 2630 v2 / 6 cores (x2)).


For reference, the code:

@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
public class SO34929316 {

  @Param({"1", "1000", "100000", "1000000", "10000000"}) int size;
  long[] array;

  @Setup(Level.Iteration)
  public void setup(){
    array = new long[size];
  }

  @Benchmark
  public void setAll(Blackhole bh) {
    Arrays.setAll(array, n -> n * n * 11 + n * 7);
    bh.consume(array);
  }

  @Benchmark
  public void parallelSetAll(Blackhole bh) {
    Arrays.parallelSetAll(array, n -> n * n * 11 + n * 7);
    bh.consume(array);
  }
}
assylias
  • 321,522
  • 82
  • 660
  • 783
  • 1
    When showing parallel benchmarks, it is a good idea to also say what system you are running on (OS, chip, cores). – Brian Goetz Jan 21 '16 at 17:34
  • @assylias thanks much for this; it definitely helps my thinking on the issue. – user1677663 Jan 21 '16 at 18:10
  • 1
    Did you check the CPU load during the benchmarking? It would be interesting to compare the speedup with the actual number of busy codes. On my machine, it looks like not using all cores… – Holger Jan 21 '16 at 19:11
  • 1
    @Holger Same here - but the load gets higher with larger sizes (~50/60% load for 100,000 but ~80/85% for 1m and 10m) - I suspect that, due to the tasks being very short, a lot of time is spent scheduling vs. actually doing something useful... – assylias Jan 21 '16 at 19:17