14

I am writing a multi-threaded application in Java in order to improve performance over the sequential version. It is a parallel version of the dynamic programming solution to the 0/1 knapsack problem. I have an Intel Core 2 Duo with both Ubuntu and Windows 7 Professional on different partitions. I am running in Ubuntu.

My problem is that the parallel version actually takes longer than the sequential version. I am thinking this may be because the threads are all being mapped to the same kernel thread or that they are being allocated to the same core. Is there a way I could ensure that each Java thread maps to a separate core?

I have read other posts about this problem but nothing seems to help.

Here is the end of main() and all of run() for the KnapsackThread class (which extends Thread). Notice that they way I use slice and extra to calculate myLowBound and myHiBound ensure that each thread will not overlap in domain of the dynProgMatrix. Therefore there will be no race conditions.

    dynProgMatrix = new int[totalItems+1][capacity+1];
    for (int w = 0; w<= capacity; w++)
        dynProgMatrix[0][w] = 0;
    for(int i=0; i<=totalItems; i++)
        dynProgMatrix[i][0] = 0;
    slice = Math.max(1,
            (int) Math.floor((double)(dynProgMatrix[0].length)/threads.length));
    extra = (dynProgMatrix[0].length) % threads.length;

    barrier = new CyclicBarrier(threads.length);
    for (int i = 0; i <  threads.length; i++){
        threads[i] = new KnapsackThread(Integer.toString(i));
    }
    for (int i = 0; i < threads.length; i++){
        threads[i].start();
    }

    for (int i = 0; i < threads.length; i++){
        try {
            threads[i].join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

public void run(){
    int myRank = Integer.parseInt(this.getName());

    int myLowBound;
    int myHiBound;

    if (myRank < extra){
        myLowBound = myRank * (slice + 1);
        myHiBound = myLowBound + slice;
    }
    else{
        myLowBound = myRank * slice + extra;
        myHiBound = myLowBound + slice - 1;
    }

    if(myHiBound > capacity){
        myHiBound = capacity;
    }

    for(int i = 1; i <= totalItems; i++){
        for (int w = myLowBound; w <= myHiBound; w++){

            if (allItems[i].weight <= w){
               if (allItems[i].profit + dynProgMatrix[i-1][w-allItems[i].weight]
                        > dynProgMatrix[i-1][w])
                {
                    dynProgMatrix[i][w] = allItems[i].profit +
                                      dynProgMatrix[i-1][w- allItems[i].weight];
                }
                else{
                    dynProgMatrix[i][w] = dynProgMatrix[i-1][w];
                }
            }
            else{
                dynProgMatrix[i][w] = dynProgMatrix[i-1][w];
            }
        }
        // now place a barrier to sync up the threads
        try {
            barrier.await(); 
        } catch (InterruptedException ex) { 
            ex.printStackTrace();
            return;
        } catch (BrokenBarrierException ex) { 
            ex.printStackTrace(); 
            return;
        }
    }
}

Update:

I have written another version of the knapsack that uses brute force. This version has very little synchronization because I only need to update a bestSoFar variable at the end of a single thread's execution. Therefore, each thread pretty much should execute completely in parallel except for that small critical section at the end.

I ran this versus the sequential brute force and still it takes longer. I don't see any other explanation than that my threads are being run sequentially, either because they are being mapped to the same core or to the same native thread.

Does anybody have any insight?

Michael Myers
  • 188,989
  • 46
  • 291
  • 292
KBP
  • 1,500
  • 2
  • 14
  • 14
  • 2
    Welcome to the world of parallel computing! It is very likely that your threads are in fact being mapped to different cores, and it is also likely that your program would in fact be faster (although still slower than the sequential version) if they were on the same core. Where did you get the parallel knapsack algorithm from? Is it designed to reduce shared-memory communication (including locking) as much as possible? – Pascal Cuoq Dec 13 '09 at 10:14
  • It might be usefull if you post a link to your KnapsackThread code, and say something about the amount of threads you're using. More than 4-8 threads may be a problem on a core duo,and synchronized blocks can bring down any code :) – extraneon Dec 13 '09 at 10:27
  • also what VM are you using, sun on windows and openjdk on ubuntu, or sun on both? – ewanm89 Dec 13 '09 at 12:21
  • I wrote my version from a sequential version by another person. It does not lock access to the shared 2d array, I do however ensure that each thread only writes to different indecies of the 2d array to avoid race conditions. I am using one thread per core which i do by: private static KnapsackThread threads[] = new KnapsackThread[ Runtime.getRuntime().availableProcessors() ]; } – KBP Dec 13 '09 at 15:59
  • I do not know what VM I am using besides 1.6. I ran the java -version in the terminal: java version "1.6.0_16" Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Java HotSpot(TM) Server VM (build 14.2-b01, mixed mode) – KBP Dec 13 '09 at 16:01
  • Watch out using `Runtime.availablePrcoessors`. I hear it will return double the number of cores on some Intel chips (HyperThreading enabled, I do believe). While this does (supposedly) represent the actual number of physical "Pipelines" this processor has. Because Intel chips have two pipelines per core. It can be misleading as running twice the number of threads as your real number of cores might not be optimal performance in certain situations. AMD scores nicely in this area, IMHO. They make 8 core processors and `Runtime.availablePrcoessors` works as expected. – Pimp Trizkit May 03 '13 at 15:22
  • (I ment - AMD makes afforable 8 core processors) – Pimp Trizkit May 03 '13 at 15:33
  • I think you are missing one important thing to keep in mind while debugging problems like this. JVM is just a program running on your machine and thus it's process competes with other processes on the machine for CPU time. So not all threads you schedule might be able to run at the same time and context switching between processes might be required more often if you have more threads. – Igor Čordaš Jul 08 '14 at 14:19

3 Answers3

22

I doubt that it will be due to using the same core for all threads. The scheduling is up to the OS, but you should be able to see what's going on if you bring up the performance manager for the OS - it will typically show how busy each core is.

Possible reasons for it taking longer:

  • Lots of synchronization (either necessary or unnecessary)
  • The tasks taking such a short time that thread creation is taking a significant proportion of the time
  • Context switching, if you're creating too many threads - for CPU intensive tasks, create as many as threads as you have cores.
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Great answer, as usual. But I suspect we can rule out your 2nd bullet point: Looking at the coding, apparently a small number of Threads is being created, and these run -once- for a goodly long time. Synchronization looks like a good bet. – Carl Smotricz Dec 13 '09 at 11:06
  • 2
    @Carl: We haven't really seen enough code to know. We don't know how big the `threads` array is, unless I've missed something. – Jon Skeet Dec 13 '09 at 11:09
  • 1
    (`Runtime.availablePrcoessors` will give the number of hardware threads.) – Tom Hawtin - tackline Dec 13 '09 at 12:03
  • Runtine.availableProcessors gives me the number of cores right? not threads – KBP Dec 13 '09 at 16:16
  • 1
    As the method name already hints, it returns the number of available processors (logical processing units, CPU cores as you want), not threads. – BalusC Dec 13 '09 at 17:52
6

I was having the same problem for a while. I had a CPU-hungry program that I divided in 2 threads (double core CPU), but one beautifull day, while processing some more data, it just stopped using both cores. I just raised the heap mem size (-Xmx1536m in my case), and it worked fine again.

wattostudios
  • 8,666
  • 13
  • 43
  • 57
arm3nio
  • 61
  • 1
  • 1
  • Wow, thanks I was doing benchmarks and was wondering why the benchmarks are using only 2 cores from 7. Increasing the heap space solved the problem for me! – seb Jul 19 '13 at 13:08
1

I suggest you take a look at how long it takes for each of your worker threads before they terminate. Perhaps one of the threads has a much more difficult task. If that's the case, then the overhead caused by synchronization and so on will easily eat up what you've gained from threading.

Buhb
  • 7,088
  • 3
  • 24
  • 38