7

So basically I needed to optimize this piece of code today. It tries to find the longest sequence produced by some function for the first million starting numbers:

public static void main(String[] args) {
    int mostLen = 0;
    int mostInt = 0;
    long currTime = System.currentTimeMillis();
    for(int j=2; j<=1000000; j++) {
        long i = j;
        int len = 0;
        while((i=next(i)) != 1) {
            len++;
        }
        if(len > mostLen) {
            mostLen = len;
            mostInt = j;
        }
    }
    System.out.println(System.currentTimeMillis() - currTime);
    System.out.println("Most len is " + mostLen + " for " + mostInt);
}


static long next(long i) {
    if(i%2==0) {
        return i/2;
    } else {
        return i*3+1;
    }
}

My mistake was to try to introduce multithreading:

void doSearch() throws ExecutionException, InterruptedException {
    final int numProc = Runtime.getRuntime().availableProcessors();
    System.out.println("numProc = " + numProc);
    ExecutorService executor = Executors.newFixedThreadPool(numProc);
    long currTime = System.currentTimeMillis();
    List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
    for (int j = 2; j <= 1000000; j++) {
        MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
        worker.setBean(new ValueBean(j, 0));
        Future<ValueBean> f = executor.submit(worker);
        list.add(f);
    }
    System.out.println(System.currentTimeMillis() - currTime);

    int mostLen = 0;
    int mostInt = 0;
    for (Future<ValueBean> f : list) {
        final int len = f.get().getLen();
        if (len > mostLen) {
            mostLen = len;
            mostInt = f.get().getNum();
        }
    }
    executor.shutdown();
    System.out.println(System.currentTimeMillis() - currTime);
    System.out.println("Most len is " + mostLen + " for " + mostInt);
}

public class MyCallable<T> implements Callable<ValueBean> {
    public ValueBean bean;

    public void setBean(ValueBean bean) {
        this.bean = bean;
    }

    public ValueBean call() throws Exception {
        long i = bean.getNum();
        int len = 0;
        while ((i = next(i)) != 1) {
            len++;
        }
        return new ValueBean(bean.getNum(), len);
    }
}

public class ValueBean {
    int num;
    int len;

    public ValueBean(int num, int len) {
        this.num = num;
        this.len = len;
    }

    public int getNum() {
        return num;
    }

    public int getLen() {
        return len;
    }
}

long next(long i) {
    if (i % 2 == 0) {
        return i / 2;
    } else {
        return i * 3 + 1;
    }
}

Unfortunately, the multithreaded version worked 5 times slower than the single-threaded on 4 processors (cores).

Then I tried a bit more crude approach:

static int mostLen = 0;
static int mostInt = 0;

synchronized static void updateIfMore(int len, int intgr) {
    if (len > mostLen) {
        mostLen = len;
        mostInt = intgr;
    }
}

public static void main(String[] args) throws InterruptedException {
    long currTime = System.currentTimeMillis();
    final int numProc = Runtime.getRuntime().availableProcessors();
    System.out.println("numProc = " + numProc);
    ExecutorService executor = Executors.newFixedThreadPool(numProc);
    for (int i = 2; i <= 1000000; i++) {
        final int j = i;
        executor.execute(new Runnable() {
            public void run() {
                long l = j;
                int len = 0;
                while ((l = next(l)) != 1) {
                    len++;
                }
                updateIfMore(len, j);
            }
        });
    }
    executor.shutdown();
    executor.awaitTermination(30, TimeUnit.SECONDS);
    System.out.println(System.currentTimeMillis() - currTime);
    System.out.println("Most len is " + mostLen + " for " + mostInt);
}


static long next(long i) {
    if (i % 2 == 0) {
        return i / 2;
    } else {
        return i * 3 + 1;
    }
}

and it worked much faster, but still it was slower than the single thread approach.

I hope it's not because I screwed up the way I'm doing multithreading, but rather this particular calculation/algorithm is not a good fit for parallel computation. If I change calculation to make it more processor intensive by replacing method next with:

long next(long i) {
    Random r = new Random();
    for(int j=0; j<10; j++) {
        r.nextLong();
    }
    if (i % 2 == 0) {
        return i / 2;
    } else {
        return i * 3 + 1;
    }
}

both multithreaded versions start to execute more than twice as fast than the singlethreaded version on a 4 core machine.

So clearly there must be some threshold that you can use to determine if it is worth to introduce multithreading and my question is:

What is the basic rule that would help decide if a given calculation is intensive enough to be optimized by running it in parallel (without spending effort to actually implement it?)

Oleg Mikheev
  • 17,186
  • 14
  • 73
  • 95
  • 1
    This is only tangentially related to the question, but the algorithm in question is related to the [Collatz conjecture](http://en.wikipedia.org/wiki/Collatz_conjecture). It's more famous in geekdom thanks to [this](http://xkcd.com/710/) and [this](http://store.xkcd.com/xkcd/#CollatzConjecture). – Michael McGowan May 22 '12 at 05:27
  • I *highly* recommend the book [Java Concurrency in Practice](http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601) by Brian Goetz. – David Harkness May 22 '12 at 05:40

4 Answers4

4

The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.

Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.

You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.

kristianp
  • 5,496
  • 37
  • 56
Eelke
  • 20,897
  • 4
  • 50
  • 76
  • 2
    +1 for using batches as 1,000,000 tasks has a high overhead with too low a payoff (reducing the "lost productivity" due to threads having nothing to do). – David Harkness May 22 '12 at 05:38
2

"Will the performance gain be greater than the cost of context switching and thread creation?"

That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.

You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.

Community
  • 1
  • 1
Daniel Pittman
  • 16,733
  • 4
  • 41
  • 34
  • +1 for one thread per CPU for CPU-heavy tasks, though you typically want one per CPU for the work plus one (the main thread) for coordination. – David Harkness May 22 '12 at 05:39
  • 1
    Also, see [this answer](http://stackoverflow.com/a/1980858/285873) for how to find the number of CPU cores available and other useful bits. – David Harkness May 22 '12 at 05:41
2

I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.

This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4. Once through the inner loop produces i == 2, but you just computed the result for j == 2 two iterations ago, if you saved the value of len you can compute it as len(4) = 1 + len(2).

Using an array to store previously computed values of len and a little bit twiddling in the next method, you can complete the task >50x faster.

Mike Zboray
  • 39,828
  • 3
  • 90
  • 122
  • Yep this runs 8 times faster than the 1000-batched multithreaded one! I wonder if I can multithread this one – Oleg Mikheev May 23 '12 at 02:11
  • @OlegMikheev It might be possible. I would look into `ConcurrentHashMap` so that I could build the cache without having to worry about locking. Although I think the array implementation is pretty fast because as soon as `i < j` you know its in the cache but a hash lookup could be a lot slower. If you can exploit additional mathematical properties of the `next` function, it is easy to show for a limit n that the j with maximum length must satisfy j > n / 2. This helps the multithreaded solution, but not the caching solution. Also a simple array cache can't scale to a limit > ~42,000,000. – Mike Zboray May 24 '12 at 06:27
1

Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.

Alexei Kaigorodov
  • 13,189
  • 1
  • 21
  • 38