3

First and once more, thanks to all that already answered my question. I am not a very experienced programmer and it is my first experience with multithreading.

I got an example that is working quite like my problem. I hope it could ease our case here.

public class ThreadMeasuring {
private static final int TASK_TIME = 1; //microseconds
private static class Batch implements Runnable {
    CountDownLatch countDown;
    public Batch(CountDownLatch countDown) {
        this.countDown = countDown;
    }

    @Override
    public void run() {         
        long t0 =System.nanoTime();
        long t = 0;
        while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }

        if(countDown!=null) countDown.countDown();
    }
}

public static void main(String[] args) {
    ThreadFactory threadFactory = new ThreadFactory() {
        int counter = 1;
        @Override
        public Thread newThread(Runnable r) {
            Thread t = new Thread(r, "Executor thread " + (counter++));
            return t;
        }
    };

  // the total duty to be divided in tasks is fixed (problem dependent). 
  // Increase ntasks will mean decrease the task time proportionally. 
  // 4 Is an arbitrary example.
  // This tasks will be executed thousands of times, inside a loop alternating 
  // with serial processing that needs their result and prepare the next ones.
    int ntasks = 4; 
    int nthreads = 2;
    int ncores = Runtime.getRuntime().availableProcessors();
    if (nthreads<ncores) ncores = nthreads;     

    Batch serial = new Batch(null);
    long serialTime = System.nanoTime();
    serial.run();
    serialTime = System.nanoTime() - serialTime;

    ExecutorService executor = Executors.newFixedThreadPool( nthreads, threadFactory );
    CountDownLatch countDown = new CountDownLatch(ntasks);

    ArrayList<Batch> batches = new ArrayList<Batch>();
    for (int i = 0; i < ntasks; i++) {
        batches.add(new Batch(countDown));
    }

    long start = System.nanoTime();
    for (Batch r : batches){
        executor.execute(r);
    }

    // wait for all threads to finish their task
    try {
        countDown.await();
    } catch (InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    long tmeasured = (System.nanoTime() - start);

    System.out.println("Task time= " + TASK_TIME + " ms");
    System.out.println("Number of tasks= " + ntasks);
    System.out.println("Number of threads= " + nthreads);
    System.out.println("Number of cores= " + ncores);
    System.out.println("Measured time= " + tmeasured);
    System.out.println("Theoretical serial time= " + TASK_TIME*1000000*ntasks);
    System.out.println("Theoretical parallel time= " + (TASK_TIME*1000000*ntasks)/ncores);
    System.out.println("Speedup= " + (serialTime*ntasks)/(double)tmeasured);

    executor.shutdown();
}
 }

Instead of doing the calculations, each batch just waits for some given time. The program calculates the speedup, that would allways be 2 in theory but can get less than 1 (actually a speed down) if the 'TASK_TIME' is small.

My calculations take at the top 1 ms and are commonly faster. For 1 ms I find a little speedup of around 30%, but in practice, with my program, I notice a speed down.

The structure of this code is very similar to my program, so if you could help me to optimise the thread handling I would be very grateful.

Kind regards.

Below, the original question:

Hi.

I would like to use multithreading on my program, since it could increase its efficiency considerably, I believe. Most of its running time is due to independent calculations.

My program has thousands of independent calculations (several linear systems to solve), but they just happen at the same time by minor groups of dozens or so. Each of this groups would take some miliseconds to run. After one of these groups of calculations, the program has to run sequentially for a little while and then I have to solve the linear systems again.

Actually, it can be seen as these independent linear systems to solve are inside a loop that iterates thousands of times, alternating with sequential calculations that depends on the previous results. My idea to speed up the program is to compute these independent calculations in parallel threads, by dividing each group into (the number of processors I have available) batches of independent calculation. So, in principle, there isn't queuing at all.

I tried using the FixedThreadPool and CachedThreadPool and it got even slower than serial processing. It seems to takes too much time creating new Treads each time I need to solve the batches.

Is there a better way to handle this problem? These pools I've used seem to be proper for cases when each thread takes more time instead of thousands of smaller threads...

Thanks! Best Regards!

ursoouindio
  • 113
  • 7
  • Would it be possible to post some code? If you use a fixed thread pool then it doesn't create threads over and over again (they are reused). – Jeff Foster Feb 15 '11 at 17:13
  • What platform are you running this on? Big difference between a multi-core server and a 5-year-old Blackberry. – MusiGenesis Feb 15 '11 at 17:17
  • @ursoouindio I have proposed a Producer/Consumer pattern with a blocking queue, please check out my answer for more details. – Kiril Feb 15 '11 at 17:35
  • @Jeff: I would have to create a similar example, since my code depend on several classes. I believe the thread creation problem is due to the alternation of parallel and sequential parts. Actually, the parallel part is over just a smaller part of the program's code. – ursoouindio Feb 15 '11 at 17:56
  • @MusiGenesis, I'm using Ubuntu Linux 10.04 on two machines: a Core 2 Duo T7250 and a Core 2 Quad Q6600. – ursoouindio Feb 15 '11 at 17:57

6 Answers6

5

Thread pools don't create new threads over and over. That's why they're pools.

How many threads were you using and how many CPUs/cores do you have? What is the system load like (normally, when you execute them serially, and when you execute with the pool)? Is synchronization or any kind of locking involved?

Is the algorithm for parallel execution exactly the same as the serial one (your description seems to suggest that serial was reusing some results from previous iteration).

Konrad Garus
  • 53,145
  • 43
  • 157
  • 230
  • CachedThreadPools will stop unused threads which have been idle for 60 seconds, but as @Konrad suggests, fixed thread pool creates the threads which it is created (whether used or not) until it is terminated. – Peter Lawrey Feb 15 '11 at 17:15
  • @Peter that is correct. I did not consider it worth mentioning as OP mentioned batches of calculations. I reckoned a batch is executed as fast as possible, so no way to be idle for 60s. – Konrad Garus Feb 15 '11 at 17:17
  • Thanks for the reply, @Konrad. Yes, its not the pools fault the numbers of threads but probably how I manage them inside my code. Thousand of times I call `executor.execute(r)`, for each parallel batch r (and executor was declared only once: `executor = Executors.newFixedThreadPool(numberOfCores);`). As I understood, they become new threads (please correct me if I'm wrong). I'm running Ubuntu 10.04 on two machines, one a dual core and the other a quad core. I use the number of threads = the number of cores. – ursoouindio Feb 15 '11 at 17:33
  • About the system load, testing right now I got both processors around 70% on multithreading and just one around 90% on serial run. I did 3 runs to test, they vary a little. There is no need of synchronisation or locking of any kind, except that at the end of a group of batches the multithreading should hold until the next opportunity and the sequential serial calculation needs the batches completed. – ursoouindio Feb 15 '11 at 17:34
  • How about using a higher number of threads? Like numberOfProcessors+2? Are you instantiating thousands of `Runnables` or `Callables` for it? – Konrad Garus Feb 15 '11 at 17:37
  • I tested using more than the number of cores, but at the time it didn't showed improvements. I do not instantiate all the Runnables at once, I do this inside the main loop. – ursoouindio Feb 15 '11 at 17:40
  • Can you tell us more about the nature of the algorithm? Does it use any IO? Is any kind of locking or shared resources involved? How much thread-local memory are you using? – Konrad Garus Feb 15 '11 at 17:43
  • Also, if your tasks are extremely small, the overhead of multithreading etc. can outweigh the benefits. How many batches are there and how long does each batch take? – Konrad Garus Feb 15 '11 at 17:50
  • @Konrad, it does not use IO or shared resources, its about the integration of a set of algebraic-differential equations. I am exactly trying to exploit benefits from its independent linear systems. I just give the initial conditions and the program runs by itself. – ursoouindio Feb 15 '11 at 18:02
  • About the overhead, maybe its the case, although I expected it to increase efficiency. About the number of batches, it depends on some factors but right now around 190 linear system were called 6850 times, to be split in 2 or 4 cores, counting 1312328 linear solves at the end. – ursoouindio Feb 15 '11 at 18:11
  • on debugging environment it took 50 ms to solve one of these systems. – ursoouindio Feb 15 '11 at 18:16
  • Sorry, the time per linear system is much faster, about 0,5 to 1 ms. – ursoouindio Feb 16 '11 at 13:04
1

I'm not sure how you perform the calculations, but if you're breaking them up into small groups, then your application might be ripe for the Producer/Consumer pattern.

Additionally, you might be interested in using a BlockingQueue. The calculation consumers will block until there is something in the queue and the block occurs on the take() call.

private static class Batch implements Runnable {
    CountDownLatch countDown;
    public Batch(CountDownLatch countDown) {
        this.countDown = countDown;
    }

    CountDownLatch getLatch(){
        return countDown;
    }

    @Override
    public void run() {         
        long t0 =System.nanoTime();
        long t = 0;
        while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }

        if(countDown!=null) countDown.countDown();
    }
}

class CalcProducer implements Runnable {
    private final BlockingQueue queue;
    CalcProducer(BlockingQueue q) { queue = q; }
    public void run() {
        try {
            while(true) { 
                CountDownLatch latch = new CountDownLatch(ntasks);
                for(int i = 0; i < ntasks; i++) {
                    queue.put(produce(latch)); 
                }
                // don't need to wait for the latch, only consumers wait
            }
        } catch (InterruptedException ex) { ... handle ...}
    }

    CalcGroup produce(CountDownLatch latch) {
        return new Batch(latch);
    }
}

class CalcConsumer implements Runnable {
    private final BlockingQueue queue;

    CalcConsumer(BlockingQueue q) { queue = q; }

    public void run() {
        try {
            while(true) { consume(queue.take()); }
        } catch (InterruptedException ex) { ... handle ...}
    }

    void consume(Batch batch) { 
        batch.Run();
        batch.getLatch().await();
    }
}

class Setup {
    void main() {
        BlockingQueue<Batch> q = new LinkedBlockingQueue<Batch>();
        int numConsumers = 4;

        CalcProducer p = new CalcProducer(q);
        Thread producerThread = new Thread(p);
        producerThread.start();

        Thread[] consumerThreads = new Thread[numConsumers];

        for(int i = 0; i < numConsumers; i++)
        {
            consumerThreads[i] = new Thread(new CalcConsumer(q));
            consumerThreads[i].start();
        }
    }
}

Sorry if there are any syntax errors, I've been chomping away at C# code and sometimes I forget the proper java syntax, but the general idea is there.

Kiril
  • 39,672
  • 31
  • 167
  • 226
  • +1 A much more elaborate (and probably much more correct) implementation than my answer above. – corsiKa Feb 15 '11 at 18:00
  • Thanks, @Lirik! I will take the time to understand your idea. – ursoouindio Feb 15 '11 at 18:57
  • @ursoouindio let me know if something seems confusing... the basic idea is similar to that of a fast food restaurant: you have a line (queue) of customers and you have multiple cashiers. The cashiers are idle until somebody queues up, then the next available cashiers calls a customer off the queue and serves them. The customers are the producers and the cashiers are the consumers. – Kiril Feb 15 '11 at 19:11
  • I follow that, @Lirik. I just missed the CalcGroup class, but I believe it's not the important part of the idea. I have updated the question with a code, can you help me apply your idea to that one? (if you think it would be effective) – ursoouindio Feb 16 '11 at 11:45
  • @ursoouindio I didn't know that you called the group of calculations a `Batch` (fine name btw) so I used a dummy name `CalcGroup`, so `CalcGroup` gets replaced by `Batch`. I've updated my answer with that in mind: note that while this would work sufficiently well, waiting on a `CountDownLatch` would not be the optimal way to use this pattern. – Kiril Feb 16 '11 at 16:43
  • @Lirik, thanks for the help but I'm affraid I cannot make benefit from queuing, since I always manage to have the same number of batches as available processing cores. Splitting my calculations duty on more batches actually slowed down my program. But I'll keep this idea in mind, thank you very much. – ursoouindio Feb 17 '11 at 11:12
1

If you have a problem which does not scale to multiple cores, you need to change your program or you have a problem which is not as parallel as you think. I suspect you have some other type of bug, but cannot say based on the information given.

This test code might help.

Time per million tasks 765 ms

code

ExecutorService es = Executors.newFixedThreadPool(4);
Runnable task = new Runnable() {
    @Override
    public void run() {
        // do nothing.
    }
};
long start = System.nanoTime();
for(int i=0;i<1000*1000;i++) {
    es.submit(task);
}
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
long time = System.nanoTime() - start;
System.out.println("Time per million tasks "+time/1000/1000+" ms");

EDIT: Say you have a loop which serially does this.

for(int i=0;i<1000*1000;i++)
    doWork(i);

You might assume that changing to loop like this would be faster, but the problem is that the overhead could be greater than the gain.

for(int i=0;i<1000*1000;i++) {
    final int i2 = i;
    ex.execute(new Runnable() {
        public void run() {
            doWork(i2);
        }
    }
}

So you need to create batches of work (at least one per thread) so there are enough tasks to keep all the threads busy, but not so many tasks that your threads are spending time in overhead.

final int batchSize = 10*1000;
for(int i=0;i<1000*1000;i+=batchSize) {
    final int i2 = i;
    ex.execute(new Runnable() {
        public void run() {
            for(int i3=i2;i3<i2+batchSize;i3++)
               doWork(i3);
        }
    }
}

EDIT2: RUnning atest which copied data between threads.

for (int i = 0; i < 20; i++) {
    ExecutorService es = Executors.newFixedThreadPool(1);
    final double[] d = new double[4 * 1024];
    Arrays.fill(d, 1);
    final double[] d2 = new double[4 * 1024];
    es.submit(new Runnable() {
        @Override
        public void run() {
            // nothing.
        }
    }).get();
    long start = System.nanoTime();
    es.submit(new Runnable() {
        @Override
        public void run() {
            synchronized (d) {
                System.arraycopy(d, 0, d2, 0, d.length);
            }
        }
    });
    es.shutdown();
    es.awaitTermination(10, TimeUnit.SECONDS);
    // get a the values in d2.
    for (double x : d2) ;
    long time = System.nanoTime() - start;
    System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}

starts badly but warms up to ~50 us.

Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
 ... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • @Peter, what sort of other bugs do you think it could be? If I can't find a solution based on descriptive information I gave I will try to come up with a representative example. – ursoouindio Feb 15 '11 at 18:27
  • Result for `ExecutorService es = Executors.newFixedThreadPool(1);`: `Time per million tasks 1435 ms` – ursoouindio Feb 15 '11 at 18:47
  • Result for `ExecutorService es = Executors.newFixedThreadPool(4);`: `Time per million tasks 1561 ms` – ursoouindio Feb 15 '11 at 18:48
  • * on a quad core, Ubuntu Linux 10.04. Tried 3 times and got the smaller value. – ursoouindio Feb 15 '11 at 18:48
  • this is exactly the kind of results I'm having. The multithreading a bit slower than the serial. – ursoouindio Feb 15 '11 at 18:49
  • So the overhead of passing a task to another thread is about 1.5 us on your machine. If your task takes less than 1.5 us its not worth passing to another thread. What you can do is group tasks together. Have each task do 1000x as much as it does now, then the overhead will be trivial in comparison. – Peter Lawrey Feb 15 '11 at 19:37
  • I read "us" as micro seconds, correct me if it I'm wrong. I'm getting convinced on the victory of overhead. I just can't add tasks in larger groups, I've already gathered the greater possible independent groups, in a similar fashion as you demonstrated. They are relatively few but they occur thousands of times. – ursoouindio Feb 15 '11 at 20:02
  • I read "us" as micro-seconds as I don't have a mu on my keyboard. ;) How is it that you have tasks which occur thousands of times but you can't group them together. If the tasks are too short lived, you are better off just processing them in the current thread. It will be faster. – Peter Lawrey Feb 15 '11 at 20:25
  • Sorry, maybe I miss some point or I'm not being clear. Happens that I'm not very experienced on this. I added some code to the question, maybe it can enlighten the case. – ursoouindio Feb 16 '11 at 11:40
  • Thats a very small example and as you say its is still faster but you are seeing the tasks running slower. Since you are sleeping you should be able to try 10 threads and 1000 tasks. – Peter Lawrey Feb 16 '11 at 12:20
  • i tried raising the number of threads, with no effect, but the number of tasks is defined by my program, since I need all that amount of tasks done before following trough the rest (serial) of my program and I will have the next amount of tasks only after the serial part. – ursoouindio Feb 16 '11 at 12:39
  • How long do these tasks take to execute? (short lived tasks can have more overhead than they help) Do they lock on shared resources? (Too much access to shared data can be slower than using a single thread to do the same thing) – Peter Lawrey Feb 16 '11 at 12:42
  • They do not access anything outside the program, and neither need any synchronisation. Usually, each task takes 0.5 to 1 ms but could be even faster if I'm dealing with a smaller case. – ursoouindio Feb 16 '11 at 12:48
  • I just tried here to rise the number of tasks (the total computation remain the same, so doubling the number of tasks needs tasks of half the duty) and got the program got slower... – ursoouindio Feb 16 '11 at 13:21
  • 0.5 to 1 ms is plenty of time to avoid overhead of starting a task. I find it odd that halfing the size of the task ran even slower. There must be something each task is doing to start or finish a task which it doesn't have to do as often when there are less tasks. – Peter Lawrey Feb 16 '11 at 13:34
  • The tasks themselves are straightforward. At first, I have to solve N linear systems, then I create batches (tasks) of N/ntasks (let the division be exact, for simplicity) linear systems. Inside the run() of each batch is a loop the goes the size of N/ntasks and after it calls `countDown.countDown();`. – ursoouindio Feb 16 '11 at 13:47
  • Actually the performance is almost the same with more batches, not that much slower. But definitely not faster and still slower than the serial version. – ursoouindio Feb 16 '11 at 14:08
  • It appears your tasks could be accessing alot of data. When this happens, your additional threads have to pull data from the context of the current thread and perhaps pass the results back again. Each double/long can take 30 ns. How much data would need to pass between threads? – Peter Lawrey Feb 16 '11 at 14:22
  • the way they are implemented, each Batch receives a pair of double[N][N2]. N is the total amount of work (160 for instance) and N2 is problem dependent but is around 10. Each batch has a loop like `for(int i=start; i – ursoouindio Feb 16 '11 at 14:40
  • To execute the batches, it is needed 2 other calls right before the `executor.execute(batch)`: The first that passes the pair of double[][] to the batch and the second that sets the countDown. – ursoouindio Feb 16 '11 at 15:05
  • each double[] is equivilent to 12 doubles (two for the overhead/object headers) 160 of those would take ~2K doubles, if the data needs to be passed each way that about 4K doubles which could take 120 micro-seconds. This is significant but not the cause of the problem as I can see. It might be double this on your CPU. I think you need to cut down the problem and see how smaller samples behave so you can see where the delay is coming from. Try passing a large double[] which is clone()d and passed back. – Peter Lawrey Feb 16 '11 at 16:39
  • Something I have noticed is; if you have a small number of tasks they take about 1 ms. Possibly because things haven't warmed. – Peter Lawrey Feb 16 '11 at 16:46
  • Actually, there are two arrays, so it will eat up 240 us for a 0.5 ms batch, it sure is significant. I will make some tests here, maybe to pass smaller arrays to each batch. Thanks for this help! – ursoouindio Feb 16 '11 at 16:56
  • Well, probably they aren't warmed. I don't even know what that is. I will look around for it. – ursoouindio Feb 16 '11 at 16:58
  • The smaller batches should be faster but the total it time it takes should be the same. Perhaps you could do some of the work on the current thread while you are waiting. Try the test I posted in EDIT2 – Peter Lawrey Feb 16 '11 at 17:00
  • When you run code for the first time it is interpreted. After a short time, some code is optimised to native code and after many iterations/calls it will be completely compiled to native code. Obviously this can make a big difference to performance. – Peter Lawrey Feb 16 '11 at 17:05
  • Interesting behaviour. I was thinking of something that I could do to warm up my problem, but maybe its just a role for the jvm... What should affect this warm up effect? The number of batches? Actually, I see that using more batches than available processing cores (queuing them) does not help. But every batch still receives the same amount of data (double[N][n2]), independently of what double[]s it will use. – ursoouindio Feb 16 '11 at 17:24
  • For a CPU intensive process the optimal number of threads is often the number of available cores. – Peter Lawrey Feb 16 '11 at 17:57
1

From what i've read: "thousands of independent calculations... happen at the same time... would take some miliseconds to run" it seems to me that your problem is perfect for GPU programming.

And i think it answers you question. GPU programming is becoming more and more popular. There are Java bindings for CUDA & OpenCL. If it is possible for you to use it, i say go for it.

zkunov
  • 3,362
  • 1
  • 20
  • 17
0

Here's a psuedo outline of what I'm thinking

class WorkerThread extends Thread {

    Queue<Calculation> calcs;
    MainCalculator mainCalc;

    public void run() {
        while(true) {
            while(calcs.isEmpty()) sleep(500); // busy waiting? Context switching probably won't be so bad.
            Calculation calc = calcs.pop(); // is it pop to get and remove? you'll have to look
            CalculationResult result = calc.calc();
            mainCalc.returnResultFor(calc,result);      
        }
    }


}

Another option, if you're calling external programs. Don't put them in a loop that does them one at a time or they won't run in parallel. You can put them in a loop that PROCESSES them one at a time, but not that execs them one at a time.

Process calc1 = Runtime.getRuntime.exec("myCalc paramA1 paramA2 paramA3");
Process calc2 = Runtime.getRuntime.exec("myCalc paramB1 paramB2 paramB3");
Process calc3 = Runtime.getRuntime.exec("myCalc paramC1 paramC2 paramC3");
Process calc4 = Runtime.getRuntime.exec("myCalc paramD1 paramD2 paramD3");

calc1.waitFor();
calc2.waitFor();
calc3.waitFor();
calc4.waitFor();

InputStream is1 = calc1.getInputStream();
InputStreamReader isr1 = new InputStreamReader(is1);
BufferedReader br1 = new BufferedReader(isr1);
String resultStr1 = br1.nextLine();

InputStream is2 = calc2.getInputStream();
InputStreamReader isr2 = new InputStreamReader(is2);
BufferedReader br2 = new BufferedReader(isr2);
String resultStr2 = br2.nextLine();

InputStream is3 = calc3.getInputStream();
InputStreamReader isr3 = new InputStreamReader(is3);
BufferedReader br3 = new BufferedReader(isr3);
String resultStr3 = br3.nextLine();

InputStream is4 = calc4.getInputStream();
InputStreamReader isr4 = new InputStreamReader(is4);
BufferedReader br4 = new BufferedReader(isr4);
String resultStr4 = br4.nextLine();
corsiKa
  • 81,495
  • 25
  • 153
  • 204
  • Thanks, @glowcoder! I am not using queues in my program, if I have 4 cores I just create 4 batches and call 4 Runnables, solve them and follow to the sequential part of the code until it needs the independent calculations again. – ursoouindio Feb 15 '11 at 17:47
  • That's what I'm saying. If you add a Queue to your thread, you can use the same thread and just give it calculations to solve when they're available. Then your mainCalc can just wait to get back the calculation results. – corsiKa Feb 15 '11 at 17:58
  • ok, I think I see your point. Inside my code, the independent calculations could be called by c code (via JNI implementation). I'll have to think a little to apply this idea, but at this time great refactorings is not in my plans. And other thing, I cant `extend Thread` since the parallel routines extend my own basic classes... – ursoouindio Feb 15 '11 at 18:42
  • About JNI: You could also use an external process instead of using a Thread. I'll edit this notion into the post. About extending Thread: You can always `implements Runnable` and to `Thread worker = new Thread(new Worker())`. – corsiKa Feb 15 '11 at 20:19
  • Actually, the JNI code runs with the main loop, it calls for several java methods, some of them I'm trying to parallelize. – ursoouindio Feb 16 '11 at 11:43
0

Hmm, CachedThreadPool seems to be created just for your case. It does not recreate threads if you reuse them soon enough, and if you spend a whole minute before you use new thread, the overhead of thread creation is comparatively negligible.

But you can't expect parallel execution to speed up your calculations unless you can also access data in parallel. If you employ extensive locking, many synchronized methods, etc you'll spend more on overhead than gain on parallel processing. Check that your data can be efficiently processed in parallel and that you don't have non-obvious synchronizations lurkinb in the code.

Also, CPUs process data efficiently if data fully fit into cache. If data sets of each thread is bigger than half the cache, two threads will compete for cache and issue many RAM reads, while one thread, if only employing one core, may perform better because it avoids RAM reads in the tight loop it executes. Check this, too.

9000
  • 39,899
  • 9
  • 66
  • 104
  • thanks for the answer! In my case, it will never need to wait 60 seconds to enter the parallel part. It is the most frequent part of my program. Do you think CachedThreadPool is better, even if Plan to always use the same number of parallel threads? – ursoouindio Feb 15 '11 at 18:20
  • And about locking and synchronizing, it is not my case since my program have simpler structure in this sense. And each thread needs just double arrays. They must count up to 2000 elements at each batch. – ursoouindio Feb 15 '11 at 18:23