Optimal number of threads

Question

I was experimented with multithreading and parallel programming. According to my current knowledge, for intensive cpu computations, the optimal thread number is equal to the number of cores, otherwise u'll have an overhead regarding context switch between threads. I am not involving hyperthreading in the game where you can double the logical processors. I tried to play with a parallel quicksort on a 4 core cpu with disabled hyperthreading.

This is a result set that made me a bit curious.

I am a bit confused reading numerous of answers why this can happen.

I can understand that 8 threads have no significant performance, but why they compute faster than 4 threads ?

CPU Cores : 4

10 tests ran with 1 thread(s), on an array of size : 30000000. Sorting process took on avg for each test : 3579
10 tests ran with 2 thread(s), on an array of size : 30000000. Sorting process took on avg for each test : 2458
10 tests ran with 4 thread(s), on an array of size : 30000000. Sorting process took on avg for each test : 1230
10 tests ran with 8 thread(s), on an array of size : 30000000. Sorting process took on avg for each test : 1030
10 tests ran with 16 thread(s), on an array of size : 30000000. Sorting process took on avg for each test : 1026
10 tests ran with 32 thread(s), on an array of size : 30000000. Sorting process took on avg for each test : 1036
10 tests ran with 64 thread(s), on an array of size : 30000000. Sorting process took on avg for each test : 1094

JMH results on 30million sample array with 10 iterations

Benchmark            Mode  Cnt     Score     Error  Units
Sample.run16Threads  avgt   10  1057.336 ±  38.845  ms/op
Sample.run1Thread    avgt   10  3385.159 ±  15.619  ms/op
Sample.run2Threads   avgt   10  1809.542 ±  23.794  ms/op
Sample.run32Threads  avgt   10  1133.388 ±  86.976  ms/op
Sample.run4Threads   avgt   10  1120.386 ± 113.688  ms/op
Sample.run64Threads  avgt   10  1546.910 ± 488.938  ms/op
Sample.run8Threads   avgt   10  1031.181 ±  52.650  ms/op

public class ParallelQuickSort extends RecursiveAction {
    private final int left;
    private final int right;
    private final int[] arr;

    public ParallelQuickSort(int left, int right, int[] arr) {
        this.left = left;
        this.right = right;
        this.arr = arr;
    }

    private int partition(int[] arr, long pivot) {
        int leftPtr = left - 1;
        int rightPtr = right;

        while ( true ) {
            while ( arr[++leftPtr] < pivot ) ;
            while ( rightPtr > 0 && arr[--rightPtr] > pivot ) ;

            if ( leftPtr >= rightPtr ) {
                break;
            } else {
                swap( arr, leftPtr, rightPtr );
            }
        }
        swap( arr, leftPtr, right );
        return leftPtr;
    }

    private void swap(int[] arr, int x, int y) {
        int temp = arr[x];
        arr[x] = arr[y];
        arr[y] = temp;
    }

    @Override
    protected void compute() {
        if ( right - left <= 0 ) {
            return;
        }
        long pivot = arr[right];
        int partition = partition( arr, pivot );
        ForkJoinTask.invokeAll( createSubTasks( partition ) );
    }

    private List<ParallelQuickSort> createSubTasks(int partition) {
        return List.of(
                new ParallelQuickSort( left, partition - 1, arr ),
                new ParallelQuickSort( partition + 1, right, arr )
        );
    }
}

public class Main {
    public static void main(String[] args) {
        SecureRandom secureRandom = new SecureRandom();
        int[] arr = IntStream.range( 0, 30_000_000 ).map( i -> secureRandom.nextInt( 200_000 ) ).toArray();
        int tests = 10;

        System.out.println( "CPU Cores : " + Runtime.getRuntime().availableProcessors() + "\n" );
        for ( int threads = 1; threads <= 64; threads *= 2 ) {
            long time = 0;
            for ( int i = 0; i < tests; ++i ) {
                int[] sample = Arrays.copyOfRange( arr, 0, arr.length );
                ForkJoinPool forkJoinPool = new ForkJoinPool(threads);
                long start = System.currentTimeMillis();
                forkJoinPool.invoke( new ParallelQuickSort( 0, arr.length - 1, sample ) );
                long end = System.currentTimeMillis();
                forkJoinPool.shutdown();
                time += end - start;
                //System.out.println( isSorted( sample ) );
            }
            System.out.println( tests + " tests ran with " + threads +
                    " thread(s), on an array of size : " + arr.length + ". Sorting process took on avg for each test : " + time / tests );
        }
    }

    private static boolean isSorted(int[] arr) {
        boolean isSorted = true;
        for ( int i = 0; i < arr.length - 1; ++i ) {
            if ( arr[i] > arr[i + 1] ) {
                isSorted = false;
                break;
            }
        }
        return isSorted;
    }
}

"Overloading" the system with more threads than cores can have benefits (this is actually what hyperthreading is partially about). I recommend reading [Latency Numbers every programmer should know (`gist.github.com`)](https://gist.github.com/l2dy/924a4650422ae762d26f97596a01fb0a). — Turing85, Dec 31 '22 at 18:00
Does [this](https://stackoverflow.com/questions/1718465/optimal-number-of-threads-per-core) answer your question? — chptr-one, Dec 31 '22 at 18:02
@DuncG The invoke method on the fork join pool submits a task to the pool, which then starts running on a separate thread, calls the compute method on that thread, and then waits for a result. — Alexander Argyriou, Dec 31 '22 at 18:14
@chptr-one i found this on your attached uestion: 1 thread per core is not the optimum. It needs to be slightly more, preferably twice that since this will allow another thread to run if a thread is temporarily blocked. Even if only on memory. This is more importnat if you have systems (P4,I7, Sun Rock etc) that feature SMT/HT) i think this covers me, thank you. — Alexander Argyriou, Dec 31 '22 at 18:15
You are _definitely_ not measuring performance correctly or in useful ways. Read https://stackoverflow.com/q/504103/869736. — Louis Wasserman, Dec 31 '22 at 19:09
@LouisWasserman thnx for noticing. I used jmh. 8 threads remain faster. 4 threads Result "org.example.Sample.run": 145.647 ±(99.9%) 26.328 ms/op [Average] (min, avg, max) = (120.948, 145.647, 181.042), stdev = 17.414 CI (99.9%): [119.319, 171.974] (assumes normal distribution) 8 threads Result "org.example.Sample.run": 110.237 ±(99.9%) 4.193 ms/op [Average] (min, avg, max) = (107.040, 110.237, 115.932), stdev = 2.774 CI (99.9%): [106.044, 114.431] (assumes normal distribution) — Alexander Argyriou, Dec 31 '22 at 20:43

Optimal number of threads

0 Answers0