4

Playing with quicksort parallelization in f# using Task based parallelism.

I cannot get the parallel code to run faster that sequential. Depth argument to 'quicksortParallel' func takes a depth argument that decides if the recursive call at that 'depth/level' is to be run sequentially or in parallel. The code can be run in sequential manner by passing a negative depth. Sequential run takes around 9 seconds to sort 2 million numbers. Now if i pass in a non-negative (<4) 'depth' value, the time almost remains the same and for 'depth' values (>4) the run time starts increasing again due to cost of parallelizing being more than the gains to be had from parallelizing the code.

What i do not understand is why don't i see a performance gain for depth argument values 0 to 4? I am running it on a 16 logical core Intel i9 CPU. How can i parallelize it?

open System
open System.Threading.Tasks
module myMod =
    let genRandomNums count =
        let rnd = System.Random()
        List.init count (fun _ -> rnd.Next())

    let rec quicksortParallel depth aList =
        match aList with
        | [] -> []
        | firstElement :: restOfList ->
            let smaller, larger =
                List.partition (fun number -> number < firstElement) restOfList
            if depth < 0 then
                let left  = quicksortParallel depth smaller
                let right = quicksortParallel depth larger
                left @ (firstElement :: right)
            else
                let left  = Task.Run(fun () -> quicksortParallel (depth-1) smaller)
                let right = Task.Run(fun () -> quicksortParallel (depth-1) larger)
                Task.WaitAll(left, right)
                left.Result @ (firstElement :: right.Result)
    
    let sampleNumbers = genRandomNums 2000000
    
    let stopWatch = System.Diagnostics.Stopwatch.StartNew()
    //let sortedSnums = quicksortParallel -1 sampleNumbers //this runs the quicksort sequentially
    let sortedSnums = quicksortParallel 4 sampleNumbers
    stopWatch.Stop()

    printfn "time taken %A millseconds\n" stopWatch.Elapsed.TotalMilliseconds
    printfn "time taken %A seconds\n" stopWatch.Elapsed.TotalSeconds
    printfn "time taken %A minutes\n" stopWatch.Elapsed.TotalMinutes
    printfn "time taken %A hours\n" stopWatch.Elapsed.TotalHours

The equivalent code in c#(without in-place partitioning) runs faster when parallelized:

class Program
    {
        static List<int> genRandomNums(int count)
        {
            var rnd = new System.Random();
            IEnumerable<int> enumerable = Enumerable.Range(0, count)
                .Select(i => new Tuple<int, int>(rnd.Next(int.MaxValue), i))
                                     //.OrderBy(i => i.Item1)
                                     .Select(i => i.Item1);
            return enumerable.ToList();
        }

        static List<T> QuickSort<T>(List<T> values, int depth)
           where T : IComparable
        {
            if (values.Count == 0)
            {
                return new List<T>();
            }

            //get the first element       
            T firstElement = values[0];

            //get the smaller and larger elements       
            var smallerElements = new List<T>();
            var largerElements = new List<T>();
            for (int i = 1; i < values.Count; i++)  // i starts at 1       
            {                                       // not 0!          
                var elem = values[i];
                if (elem.CompareTo(firstElement) < 0)
                {
                    smallerElements.Add(elem);
                }
                else
                {
                    largerElements.Add(elem);
                }
            }

            //return the result       
            var result = new List<T>();
            if (depth < 0)
            {
                List<T> smallList = QuickSort(smallerElements.ToList(), depth);
                result.AddRange(smallList);
                result.Add(firstElement);
                List<T> bigList = QuickSort(largerElements.ToList(), depth);
                result.AddRange(bigList);
                return result;
            }
            else
            {
                Task<List<T>> smallTask = Task.Run(() => { return QuickSort(smallerElements.ToList(), depth - 1); });
                Task<List<T>> bigTask = Task.Run(() => { return QuickSort(largerElements.ToList(), depth - 1); });


                List<Task<List<T>>> tasks = new List<Task<List<T>>>();
                tasks.Add(smallTask);
                tasks.Add(bigTask);
                Task.WaitAll(tasks.ToArray());

                List<T> smallList = smallTask.Result;
                result.AddRange(smallList);

                result.Add(firstElement);

                List<T> bigList = bigTask.Result;
                result.AddRange(bigList);
                return result;
            }
        }

        static void Main(string[] args)
        {
            var sampleNumbers = genRandomNums(50000000);

            int depth = 4;//set it to a negative value to run serially
            var stopWatch = System.Diagnostics.Stopwatch.StartNew();
            List<int> sortedList = QuickSort<int>(sampleNumbers, depth);
            stopWatch.Stop();

            Console.WriteLine("time taken {0} seconds\n", stopWatch.Elapsed.TotalSeconds);
            Console.WriteLine("time taken {0} minutes\n", stopWatch.Elapsed.TotalMinutes);
        }
    }

A correct implementation of quicksort in F# which uses in-place sorting/partitioning does run faster when Task parallelized.

module myMod =
    
    let genRandomNums_arr count =
        let rnd = System.Random()
        Array.init count (fun _ -> rnd.Next(System.Int32.MaxValue))
    
    let swap (aArray: int array) indexA indexB = 
        let temp = aArray.[indexA]
        Array.set aArray indexA (aArray.[indexB])
        Array.set aArray indexB (temp)

    let partition (aArray: int array) first last =
        let pivot = aArray.[last]
        let mutable wallindex = first;
        let mutable currentindex = first
        while currentindex < last do  
            if aArray.[currentindex] < pivot then
                swap aArray wallindex currentindex
                wallindex <- wallindex + 1

            currentindex <- currentindex + 1    

        swap aArray wallindex last
        wallindex

    let rec quicksortParallelInPlace (aArray: int array) first last depth =
        if ((last - first) >= 1) then
            let pivotposition = partition aArray first last
            if depth < 0 then
                quicksortParallelInPlace aArray first (pivotposition - 1) depth
                quicksortParallelInPlace aArray (pivotposition + 1) last depth
            else
                let left  = Task.Run(fun () -> quicksortParallelInPlace aArray first (pivotposition - 1) (depth-1))
                let right = Task.Run(fun () -> quicksortParallelInPlace aArray (pivotposition + 1) last (depth-1))
                Task.WaitAll(left, right)
                        

    let quickSortInPlace (aArray: int array) depth =
        quicksortParallelInPlace aArray 0 (aArray.Length - 1) depth

    let sampleNumbers_arr = genRandomNums_arr 50000000    
    //printfn "un-sorted list %A" sampleNumbers_arr 

    let stopWatch1 = System.Diagnostics.Stopwatch.StartNew()
    //let sortedSnums = quicksortParallel -1 sampleNumbers //this runs the quicksort sequentially
    quickSortInPlace sampleNumbers_arr 4 //run serially using a negative number
    stopWatch1.Stop()

    //printfn "un-sorted list %A" sampleNumbers_arr

    printfn "time taken %A millseconds\n" stopWatch1.Elapsed.TotalMilliseconds
    printfn "time taken %A seconds\n" stopWatch1.Elapsed.TotalSeconds
    printfn "time taken %A minutes\n" stopWatch1.Elapsed.TotalMinutes
    printfn "time taken %A hours\n" stopWatch1.Elapsed.TotalHours        
umbersar
  • 1,821
  • 3
  • 24
  • 34
  • I wonder if for example, you call `Task.Run()` on, say, highest depth (depth=4), is it going to occupy the CPU until both its tasks are returned? I'm asking since obviously the leaf datasets (size=125000 and depth=0) are all supposed to run on 16 CPUs, but what if the upper depths still occupy them? – mangusta Aug 28 '20 at 04:27
  • ah the leaf datasets are actually sized 62500 and their count is 32. there will be 2 tasks running on each core but that's the best case. I wonder how the task is scheduled if all cores are already occupied. In the worst case, there will be 15 tasks running each on its core and 17 (1+16) tasks running on the same core. have you tried setting initial depth to 3 instead of 4? – mangusta Aug 28 '20 at 06:46
  • or probably the list is very close to sorted hence the size of both halves is so much different from each other that the parallelization doesn't actually help. it makes sense to check the array values too – mangusta Aug 28 '20 at 06:58
  • yeah, i have played with depths of various sizes but no luck. Your other point that list might be very close to sorted at the start is probable but chances are very less. And if we consider that this is the observed behavior over multiple runs, i would say it is close to 0 probability of that being the case. – umbersar Aug 28 '20 at 10:30
  • are you sure that the tasks are scheduled across the cores? it seems that "Task.Run" allocates a thread from thread pool to run a specific task, it doesn't mean running it on another core though. Not sure about F#, but for example Golang has this function: https://golang.org/pkg/runtime/#GOMAXPROCS for setting the actual # of cores that will execute the app. I wonder if there is a similar function in F# – mangusta Aug 28 '20 at 11:47
  • these posts would be useful: https://stackoverflow.com/questions/15029869/how-to-limit-the-cores-available-in-an-f-application-is-it-sufficient-to-chang and also C#'s "Parallel.ForEach" alternative – mangusta Aug 28 '20 at 11:56
  • 1
    You're using lists for sorting, don't do that. Lists are singly linked recursive data structures. You'll get a much faster algorithm with arrays. Likely, a one time conversion from list to array and back will be faster than the chopping you're doing here. – Abel Aug 28 '20 at 12:48
  • @Abel Ok. Thanks. But the thing i do not understand is even if it is in-efficient using lists, still parallel code should have run faster(both serial and parallel version are using lists) – umbersar Sep 03 '20 at 05:59

1 Answers1

2

I suspect the culprit for the low performance is actually List.partition. See this. You might be better by computing the indices of the partition and work with them than copy around partitions.

Nick
  • 4,787
  • 2
  • 18
  • 24
  • Yes, to do a in-place partition is the 'correct' solution to the quicksort problem and that is what the original algo talks about. And that should also lead to drastically lower memory consumption. But the point here is that even in a immutable solution(as the one implemented) to quicksort which is not efficient, we should still see reduced runtime benefits after running parallel tasks. But that is not what is happening here. – umbersar Sep 02 '20 at 00:17
  • 1
    You will not, because partitioning is eating most of the time, and the rest easily becomes negligible due to the side effects of storing the new partitions in the cache. Cache can have a dramatic impact. See [this answer](https://stackoverflow.com/a/53056257/383426). – Nick Sep 02 '20 at 06:38
  • Well I am also doing partitioning in the C# version(shared above in question) but for it the parallel version runs faster. So it does not explain why it would not do for F# – umbersar Sep 03 '20 at 06:03
  • 2
    I believe you are comparing apples to oranges. F#'s List is not the same as NET's List. You may want to test the performance of these separately. – Nick Sep 03 '20 at 13:30
  • 2
    @umbersar the F# alias for System.Collections.Generic.List is ResizeArray. F# lists are pretty inefficient compared to arrays in several respects. Given their immutable nature something like mergesort might make more sense. ResizeArray (C# lists) supports random addressing, which is critical for quicksort. The pseudo-quicksort here isn't really quicksort. – phoog Sep 04 '20 at 00:13