4

Recently, I was running some scalability experiments using Java Fork-Join. Here, I used the non-default ForkJoinPool constructor ForkJoinPool(int parallelism), passing the desired parallelism (# workers) as constructor argument.

Specifically, using the following piece of code:

public static void main(String[] args) throws InterruptedException {
    ForkJoinPool pool = new ForkJoinPool(Integer.parseInt(args[0]));
    pool.invoke(new ParallelLoopTask());    
}

static class ParallelLoopTask extends RecursiveAction {

    final int n = 1000;

    @Override
    protected void compute() {
        RecursiveAction[] T = new RecursiveAction[n];
        for(int p = 0; p < n; p++){
            T[p] = new DummyTask();
            T[p].fork();
        }
        for(int p = 0; p < n; p++){
            T[p].join();
        }
        /*
        //The problem does not occur when tasks are joined in the reverse order, i.e.
        for(int p = n-1; p >= 0; p--){
            T[p].join();
        }
        */
    }
}


static public class DummyTask extends RecursiveAction {
    //performs some dummy work

    final int N = 10000000;

    //avoid memory bus contention by restricting access to cache (which is distributed)
    double val = 1;

    @Override
    protected void compute() {
        for(int j = 0; j < N; j++){
            if(val < 11){
                val *= 1.1;
            }else{
                val = 1;
            }
        }
    }
}

I got these results on a processor with 4 physical and 8 logical cores (Using java 8: jre1.8.0_45):

T1: 11730

T2: 2381 (speedup: 4,93)

T4: 2463 (speedup: 4,76)

T8: 2418 (speedup: 4,85)

While when using java 7 (jre1.7.0), I get

T1: 11938

T2: 11843 (speedup: 1,01)

T4: 5133 (speedup: 2,33)

T8: 2607 (speedup: 4,58)

(where TP is the execution time in ms, using parallelism level P)

While both results surprise me, the latter I can understand (the join will cause 1 worker (executing the loop) to block, as it fails to recognize that it could, while waiting, process other pending dummy tasks from its local queue). The former, however, got me puzzled.

BTW: When counting the number of started, but not yet completed dummy tasks, I found that up to 24 such tasks existed in a pool with parallelism 2 at some point in time...?

EDIT:

I benchmarked the application above using JMH (jdk1.8.0_45) (options -bm avgt -f 1) (= 1 fork, 20+20 iterations) The results below

T1: 11,664

11,664 ±(99.9%) 0,044 s/op [Average]
(min, avg, max) = (11,597, 11,664, 11,810), stdev = 0,050
CI (99.9%): [11,620, 11,708] (assumes normal distribution)

T2: 4,134 (speedup: 2,82)

4,134 ±(99.9%) 0,787 s/op [Average]
(min, avg, max) = (3,045, 4,134, 5,376), stdev = 0,906
CI (99.9%): [3,348, 4,921] (assumes normal distribution)

T4: 2,972 (speedup: 3,92)

2,972 ±(99.9%) 0,212 s/op [Average]
(min, avg, max) = (2,375, 2,972, 3,200), stdev = 0,245
CI (99.9%): [2,759, 3,184] (assumes normal distribution)

T8: 2,845 (speedup: 4,10)

2,845 ±(99.9%) 0,306 s/op [Average]
(min, avg, max) = (2,277, 2,845, 3,310), stdev = 0,352
CI (99.9%): [2,540, 3,151] (assumes normal distribution)

At first sight one would think these scalability results are closer to what one would expect i.e. T1 < T2 < T4 ~ T8. However, what still bugs me is the following:

  1. The difference for T2 between java 7 and 8. I guess one explanation would be that the worker executing the parallel loop does not go idle in java 8, but instead finds other work to perform.
  2. The super-linear speedup (3x) with 2 workers. Also, note that T2 seems to increase with every iteration (see below, note that this is also the case, although to a smaller extent with P=4,8). The times in the first iterations of the warmup are similar to those mentioned above. Maybe the warmup period should be longer, but still, isn't it strange that execution time increases, i.e. I'd rather expect it to decrease?
  3. Finally, I still find the observation that there are a lot more started & not completed dummy tasks than worker threads curious.

>

Run progress: 0,00% complete, ETA 00:00:40
Fork: 1 of 1
Warmup Iteration   1: 2,365 s/op
Warmup Iteration   2: 2,341 s/op
Warmup Iteration   3: 2,393 s/op
Warmup Iteration   4: 2,323 s/op
Warmup Iteration   5: 2,925 s/op
Warmup Iteration   6: 3,040 s/op
Warmup Iteration   7: 2,304 s/op
Warmup Iteration   8: 2,347 s/op
Warmup Iteration   9: 2,939 s/op
Warmup Iteration  10: 3,083 s/op
Warmup Iteration  11: 3,004 s/op
Warmup Iteration  12: 2,327 s/op
Warmup Iteration  13: 3,083 s/op
Warmup Iteration  14: 3,229 s/op
Warmup Iteration  15: 3,076 s/op
Warmup Iteration  16: 2,325 s/op
Warmup Iteration  17: 2,993 s/op
Warmup Iteration  18: 3,112 s/op
Warmup Iteration  19: 3,074 s/op
Warmup Iteration  20: 2,354 s/op
Iteration   1: 3,045 s/op
Iteration   2: 3,094 s/op
Iteration   3: 3,113 s/op
Iteration   4: 3,057 s/op
Iteration   5: 3,050 s/op
Iteration   6: 3,106 s/op
Iteration   7: 3,080 s/op
Iteration   8: 3,370 s/op
Iteration   9: 4,482 s/op
Iteration  10: 4,325 s/op
Iteration  11: 5,002 s/op
Iteration  12: 4,980 s/op
Iteration  13: 5,121 s/op
Iteration  14: 4,310 s/op
Iteration  15: 5,146 s/op
Iteration  16: 5,376 s/op
Iteration  17: 4,810 s/op
Iteration  18: 4,320 s/op
Iteration  19: 5,249 s/op
Iteration  20: 4,654 s/op
Codiloo
  • 83
  • 4
  • 1
    possible duplicate of [How do I write a correct micro-benchmark in Java?](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) – Raedwald Jul 07 '15 at 15:08
  • This question is not about how to perform benchmarks in Java. It is about a curious observation using java FJ. It might eventually turn out to be caused by poor benchmarking, but currently this is no given. Also, the code shown is a simplified version. I'm testing an actual application (different numbers, very similar observations), so no micro-benchmark. – Codiloo Jul 07 '15 at 15:19
  • 1
    Except we do not know how you performed your benchmark, and there is the suspicion that, in fact, your did not do so correctly. – Raedwald Jul 07 '15 at 15:24
  • I definitely didn't do it correctly. Nonetheless, this doesn't mean the observation is wrong though. The question might be missing information (which I'll rectify asap), but is not a duplicate to the question you reference above. – Codiloo Jul 07 '15 at 15:28
  • 1
    In fact numerous questions asking about strange benchmarking results have been closed as duplicates of that _canonical question_ because they were not performed correctly. – Raedwald Jul 07 '15 at 15:30

1 Answers1

-3

There is nothing in your example of how you did this benchmark. It looks like you just did a milli-time at the beginning and end of the run. This is not accurate. I suggest you take a look at this SO answer and re-post your timings. BTW the jmh benchmark is going to be the standard in Java9 so that is what you should be using.

EDIT:

You admit that the scalability results are what you expected. But you say you’re still not happy with the results. Now it’s time to look inside the code.

There are serious problems with this framework. I’ve been writing a critique about it since 2010. As I point out here , join doesn’t work. The author has tried various means to get around the problem but the problem persists.

Increase your run time to about a minute, (n=100000000) or put some heavy computations in the compute(). Now profile the application in VisualVM or another profiler. This will show you the stalling threads, excessive threads, etc.

If that doesn’t help answer your questions than you should look at the code flow using a debugger. Profiling/code analysis is the only way you are going to get satisfactory answers to your questions.

Community
  • 1
  • 1
edharned
  • 1,884
  • 1
  • 19
  • 20
  • The timings mentioned above are rather naive wall-clock times (measured as the difference in time, right before and after the execution). Also no warmup was performed. I realize these are poor benchmarking practices. Nonetheless, the time differences illustrated in my question are rather large + can be made arbitrarily large, by increasing n and N (the work). I'll look into jmh and update the question as soon as I find the time. – Codiloo Jul 07 '15 at 15:11
  • 3
    This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. – Gimby Jul 07 '15 at 15:30
  • @Gimby Without a good benchmark it isn't worth the effort to go through the code (F/J Pool, F/J Task, etc.) line by line looking for why. See the link I supplied and the comments on the original question. – edharned Jul 07 '15 at 18:43
  • 1
    @edharned I don't see how that is any kind of argument against the standard review vote comment you are replying to. Its not an answer, at best a comment on the question. – Gimby Jul 07 '15 at 21:42
  • @Gimby OK that's your opinion. The OP says the time differences are large and I say you cannot judge small/large without doing a warm-up (compile the code, etc.) If he were doing a simple compare of A to B then wall clock time is reasonable. But here he's doing multiple threads (which add overhead themselves) within the same structure. IMHO that requires a proper benchmark for me to look at the internal code. – edharned Jul 07 '15 at 22:21
  • Today I had some time and tried to get JMH working in Maven, but I got stuck... Could u have a look at [this](http://stackoverflow.com/questions/31292059/strange-output-when-using-jmh?noredirect=1#comment50574993_31292059) – Codiloo Jul 08 '15 at 18:54