Weak performance of CyclicBarrier with many threads: Would a tree-like synchronization structure be an alternative?

Question

Our application requires all worker threads to synchronize at a defined point. For this we use a CyclicBarrier, but it does not seem to scale well. With more than eight threads, the synchronization overhead seems to outweigh the benefits of multithreading. (However, I cannot support this with measurement data.)

EDIT: Synchronization happens very frequently, in the order of 100k to 1M times.

If synchronization of many threads is "hard", would it help building a synchronization tree? Thread 1 waits for 2 and 3, which in turn wait for 4+5 and 6+7, respectively, etc.; after finishing, threads 2 and 3 wait for thread 1, thread 4 and 5 wait for thread 2, etc..

1
| \
2   3
|\  |\
4 5 6 7

Would such a setup reduce synchronization overhead? I'd appreciate any advice.

See also this featured question: What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?

maybe you can use ForkJoinPool – irreputable Sep 27 '12 at 23:17 — irreputable, Sep 27 '12 at 23:17
@irreputable No Java 7 (his comment on my answer) – Brian Sep 27 '12 at 23:39 — Brian, Sep 27 '12 at 23:39

score 2 · Answer 1 · answered Sep 28 '12 at 01:23

With more than eight threads, the synchronization overhead seems to outweigh the benefits of multithreading. (However, I cannot support this with measurement data.)

Honestly, there's your problem right there. Figure out a performance benchmark and prove that this is the problem, or risk spending hours / days solving the entirely wrong problem.

score 1 · Accepted Answer · answered Sep 27 '12 at 22:42

1

You are thinking about the problem in a subtly wrong way that tends to lead to very bad coding. You don't want to wait for threads, you want to wait for work to be completed.

Probably the most efficient way is a shared, waitable counter. When you make new work, increment the counter and signal the counter. When you complete work, decrement the counter. If there is no work to do, wait on the counter. If you drop the counter to zero, check if you can make new work.

answered Sep 27 '12 at 22:42

David Schwartz

179,497
17
214
278

No, this is not a producer-consumer problem. A big problem is split into small parts that are handled in parallel. Occasionally, all subproblems have to be merged and reshuffled. How does the shared waitable counter help? – krlmlr Sep 27 '12 at 22:46
When you split the problem in 8 parts, you set the counter to 8 (there are now 8 units of work in progress). As you finish each part, you decrement the counter. When the counter hits zero, you split the next part of the problem. – David Schwartz Sep 27 '12 at 22:47
Fair enough. But isn't the `CyclicBarrier` supposed to do exactly this? I've looked at the code, it's quite non-trivial. Still (or perhaps just because of that?), performance is weak. – krlmlr Sep 27 '12 at 22:56
1

`CyclicBarrier` synchronizes *threads*. You actually want to wait for work to be completed. You don't, in principle, care about what threads happens to complete it. Re-read my first paragraph and see if you can get your mind to switch to a different way of looking at the problem. (If you are waiting for a package to arrive at your door, don't wait for the postman who delivers it, wait for the package. And don't make the postman wait until you notice that the package is there. Once he delivers the package, he can go on -- at least in principle, there may be nothing for him to do.) – David Schwartz Sep 27 '12 at 23:00
I think I get your idea. So, synchronizing *work packages* is much cheaper than synchronizing *threads*? Why? – krlmlr Sep 27 '12 at 23:07
@user946850 because signaling the completion of a work package does not mean terminating the thread/s. Read carefully what David has posted. Collect all the thread primers/tutorias/books/bookmarked sites that mention the word 'Join' on the first page and burn/delete them now, before the infection spreads. – Martin James Sep 27 '12 at 23:18
@MartinJames: I understand that I don't need to *terminate* a thread to wait for the completion of a *task* it is working on. But how does `CyclicBarrier` terminate the thread? – krlmlr Sep 27 '12 at 23:25
@user946850, Yeah, you're pretty much right here. This answers kind of tanks with the second paragraph. It is true that in essence you can think of the work being done as separate from the worker which does it. However, it doesn't make sense to replace a CyclicBarrier with a counter and some wait/notify messages. Actually, I'd guess the CyclicBarrier would be much less error prone. – Tim Bender Sep 28 '12 at 00:27
@TimBender: A CyclicBarrier forces him to wait for threads when he really wants to wait for work. It compels the broken design. – David Schwartz Sep 28 '12 at 03:04

Brian · Answer 3 · 2012-09-27T23:35:51.853

If I understand correctly, you're trying to break your solution up into parts and solve them separately, but concurrently, right? Then have your current thread wait for those tasks? You want to use something like a fork/join pattern.

List<CustomThread> threads = new ArrayList<CustomThread>();
for (Something something : somethings) {
    threads.add(new CustomThread(something));
}
for (CustomThread thread : threads) {
    thread.start();
}
for (CustomThread thread : threads) {
    thread.join(); // Blocks until thread is complete
}
List<Result> results = new ArrayList<Result>();
for (CustomThread thread : threads) {
    results.add(thread.getResult());
}
// do something with results.

In Java 7, there's even further support via a fork/join pool. See ForkJoinPool and its trail, and use Google to find one of many other tutorials.

You can recurse on this concept to get the tree you want, just have the threads you create generate more threads in the exact same way.

Edit: I was under the impression that you wouldn't be creating that many threads, so this is better for your scenario. The example won't be horribly short, but it goes along the same vein as the discussion you're having in the other answer, that you can wait on jobs, not threads.

First, you need a Callable for your sub-jobs that takes an Input and returns a Result:

public class SubJob implements Callable<Result> {
    private final Input input;

    public MyCallable(Input input) {
        this.input = input;
    }

    public Result call() {
        // Actually process input here and return a result
        return JobWorker.processInput(input);
    }
}

Then to use it, create an ExecutorService with a fix-sized thread pool. This will limit the number of jobs you're running concurrently so you don't accidentally thread-bomb your system. Here's your main job:

public class MainJob extends Thread {

    // Adjust the pool to the appropriate number of concurrent
    // threads you want running at the same time
    private static final ExecutorService pool = Executors.newFixedThreadPool(30);
    private final List<Input> inputs;

    public MainJob(List<Input> inputs) {
        super("MainJob")
        this.inputs = new ArrayList<Input>(inputs);
    }

    public void run() {
        CompletionService<Result> compService = new ExecutorCompletionService(pool);
        List<Result> results = new ArrayList<Result>();
        int submittedJobs = inputs.size();
        for (Input input : inputs) {
            // Starts the job when a thread is available
            compService.submit(new SubJob(input)); 
        }
        for (int i = 0; i < submittedJobs; i++) {
            // Blocks until a job is completed
            results.add(compService.take())
        }
        // Do something with results
    }
}

This will allow you to reuse threads instead of generating a bunch of new ones every time you want to run a job. The completion service will do the blocking while it waits for jobs to complete. Also note that the results list will be in order of completion.

You can also use Executors.newCachedThreadPool, which creates a pool with no upper limit (like using Integer.MAX_VALUE). It will reuse threads if one is available and create a new one if all the threads in the pool are running a job. This may be desirable later if you start encountering deadlocks (because there's so many jobs in the fixed thread pool waiting that sub jobs can't run and complete). This will at least limit the number of threads you're creating/destroying.

Lastly, you'll need to shutdown the ExecutorService manually, perhaps via a shutdown hook, or the threads that it contains will not allow the JVM to terminate.

Hope that helps/makes sense.

Yes, except that `start` and `join` happen very frequently. And we're on Java 6. What are the options here? — krlmlr, Sep 27 '12 at 22:58
Maybe you could test the performance of combining an `ExecutorService` with `ExecutorCompletionService`, that way you can reuse threads and control the number of concurrent jobs? If you want, I'll add some code for that to my answer, but I've never load tested this before. I'd be curious to know if it's any faster. — Brian, Sep 27 '12 at 23:03
Any option where start and join do not happen frequently would be a better one. Please, burn those books! — Martin James, Sep 27 '12 at 23:19
@MartinJames Fair enough :) I thought the number of sub jobs wouldn't be horrible enough to warrant a thread pool, but I stand corrected. — Brian, Sep 27 '12 at 23:37
Thank you, this makes sense to me. Can the `submit` queue overflow so that a deadlock condition occurs? — krlmlr, Sep 27 '12 at 23:45
No, the queue is unbounded (I think it defaults to LinkedBlockingQueue). The deadlock occurs if the pool is too full to run sub-sub-jobs that sub-jobs are waiting on. — Brian, Sep 28 '12 at 00:17
Just to clarify, if you have a pool of size 30 and you have submitted 30 sub jobs, then those subjobs all submit their own jobs and wait on them, the sub-sub-jobs wont have open threads to run on, so the jobs that created them never finish and the system freezes up. — Brian, Sep 28 '12 at 00:37

score 0 · Answer 4 · edited May 23 '17 at 12:04

If you have a generation task (like the example of processing columns of a matrix) then you may be stuck with a CyclicBarrier. That is to say, if every single piece of work for generation 1 must be done in order to process any work for generation 2, then the best you can do is to wait for that condition to be met.

If there are thousands of tasks in each generation, then it may be better to submit all of those tasks to an ExecutorService (ExecutorService.invokeAll) and simply wait for the results to return before proceeding to the next step. The advantage of doing this is eliminating context switching and wasted time/memory from allocating hundreds of threads when the physical CPU is bounded.

If your tasks are not generational but instead more of a tree-like structure in which only a subset need to be complete before the next step can occur on that subset, then you might want to consider a ForkJoinPool and you don't need Java 7 to do that. You can get a reference implementation for Java 6. This would be found under whatever JSR introduced the ForkJoinPool library code.

I also have another answer which provides a rough implementation in Java 6:

public class Fib implements Callable<Integer> {
    int n;
    Executor exec;

    Fib(final int n, final Executor exec) {
        this.n = n;
        this.exec = exec;
    }

    /**
     * {@inheritDoc}
     */
    @Override
    public Integer call() throws Exception {
        if (n == 0 || n == 1) {
            return n;
        }

        //Divide the problem
        final Fib n1 = new Fib(n - 1, exec);
        final Fib n2 = new Fib(n - 2, exec);

        //FutureTask only allows run to complete once
        final FutureTask<Integer> n2Task = new FutureTask<Integer>(n2);
        //Ask the Executor for help
        exec.execute(n2Task);

        //Do half the work ourselves
        final int partialResult = n1.call();

        //Do the other half of the work if the Executor hasn't
        n2Task.run();

        //Return the combined result
        return partialResult + n2Task.get();
    }

}

Keep in mind that if you have divided the tasks up too much and the unit of work being done by each thread is too small, there will negative performance impacts. For example, the above code is a terribly slow way to solve Fibonacci.

Note that following the pattern outlined in the `call` method above prevents deadlocks regardless of the size of the pool of threads in the provided Executor. — Tim Bender, Sep 28 '12 at 00:55

Weak performance of CyclicBarrier with many threads: Would a tree-like synchronization structure be an alternative?

4 Answers4