Fast sort partially sorted array

Question

Firstly, it's not about an array with subsequences that may be in some order before we start sort, it's an about array of special structure.

I'm writing now a simple method that sorts data. Until now, I used Array.Sort, but PLINQ's OrderBy outperform standard Array.Sort on large arrays.

So i decide to write my own implementation of multithreading sort. Idea was simple: split an array on partitions, parallel sort each partition, then merge all results in one array.

Now i'm done with partitioning and sorting:

public class PartitionSorter
{
    public static void Sort(int[] arr)
    {
        var ranges = Range.FromArray(arr);
        var allDone = new ManualResetEventSlim(false, ranges.Length*2);
        int completed = 0;
        foreach (var range in ranges)
        {
            ThreadPool.QueueUserWorkItem(r =>
            {
                var rr = (Range) r;
                Array.Sort(arr, rr.StartIndex, rr.Length);
                if (Interlocked.Increment(ref completed) == ranges.Length)
                    allDone.Set();
            }, range);
        }
        allDone.Wait();
    }
}

public class Range
{
    public int StartIndex { get; }
    public int Length { get; }

    public Range(int startIndex, int endIndex)
    {
        StartIndex = startIndex;
        Length = endIndex;
    }

    public static Range[] FromArray<T>(T[] source)
    {
        int processorCount = Environment.ProcessorCount;
        int partitionLength = (int) (source.Length/(double) processorCount);
        var result = new Range[processorCount];
        int start = 0;
        for (int i = 0; i < result.Length - 1; i++)
        {
            result[i] = new Range(start, partitionLength);
            start += partitionLength;
        }
        result[result.Length - 1] = new Range(start, source.Length - start);
        return result;
    }
}

As result I get an array with special structure, for example

[1 3 5 | 2 4 7 | 6 8 9]

Now how can I use this information and finish sorting? Insertion sorts and others doesn't use information that data in blocks is already sorted, and we just need to merge them together. I tried to apply some algorithms from Merge sort, but failed.

Since you are essentially doing a Merge-Sort you should continue in that direction! Why did you fail to implement the Merge-Sort? — MrPaulch, Feb 25 '16 at 14:22
@MrPaulch because it's hard to implement `in-place` merge sort. Before I used a qsort, which is good in place, but its performance is worse than a naive singlethread `Array.Sort` because of random memory access. — Alex Zhukovskiy, Feb 25 '16 at 14:23
Then you should adapt you algorithm to [Quicksort](https://en.wikipedia.org/wiki/Quicksort) - `Array.Sort` actually uses Introsort which is a hybrid of *Quicksort* and *Heapsort* - I once implemented a highly specialized Quicksort algorithm that outperformed `Array.Sort` since it knew the kind of data it had to sort. — MrPaulch, Feb 25 '16 at 14:30
@MrPaulch it's the very case, I will continue my issues in this direction, because multithreading works very bad with qsort (because of memory overlapping and so on). I mean we should sort partitions and *then* sort, becuase if we swap these steps, we will lose all benefits from multithreading. I just asked for an advice, maybe there is some very efficient algorithm for this one. Profit even exist if you call Array.Sort after all these manipulations, because it has some optimisations for partially sorted data. But our priori knowlege can help us to make algorithm which will be much faster. — Alex Zhukovskiy, Feb 25 '16 at 14:34
[Check out this parallel quicksort](http://stackoverflow.com/a/1897484/106159) — Matthew Watson, Feb 25 '16 at 14:40
https://en.wikipedia.org/wiki/Timsort is normally quite fast on partially sorted collections... *The algorithm finds subsets of the data that are already ordered, and uses that knowledge to sort the remainder more efficiently* — xanatos, Feb 25 '16 at 14:41
Btw. I did use unsafe code for my qsort to fix the memory location of the array. Which did some but not significant changes to the performance. It was singlethreaded though. I think you could do a hybrid. Qsorting small partitions and merge sorting the last few partitions. — MrPaulch, Feb 25 '16 at 14:42
@MatthewWatson it was first in google so I used it, but unfortunly it was a bit slower than a sequental one. Here is results http://i.imgur.com/5QYdgki.png — Alex Zhukovskiy, Feb 25 '16 at 14:45
Yeah, the problem with most parallel implementations (like that one) is that they don't limit the number of threads to the number of processor cores, so then the context switching overhead destroys any performance gains. — Matthew Watson, Feb 25 '16 at 14:47
@MatthewWatson that's not true, because ThreadPool always uses several threads and doesn't use more threads than required. But I agree that we have some gain by doing it manually and not creating additional ThreadPool tasks, but I don't think it's very significant. But in this very implementation I create as much threads as required, and even in this case I'm not creating them manually, I use threadpool. So I don't know why you think that this realisation has some problems. If you think so, I will read your arguments with pleasure, because it will improve me and my code. — Alex Zhukovskiy, Feb 25 '16 at 14:50
@AlexZhukovskiy The thread pool will use far more threads than there are processor cores. Try this: `int workers, io; ThreadPool.GetMaxThreads(out workers, out io); Console.WriteLine(workers);` But when scheduling threads, it will start adding a delay between starting new threads once it gets past a certain number. — Matthew Watson, Feb 25 '16 at 15:39
@MatthewWatson it's not the actual value, it only the theoretical limit of threads that can be used. ThreadPool ***never*** use them all simultaneously in most cases. — Alex Zhukovskiy, Feb 26 '16 at 07:01
@AlexZhukovskiy Nevetherless, the point is that more threads than processor cores will be used, which is one of the reasons that the multithreaded implementation can be slower than the sequential implementation. You can, of course, see how many threads are used by printing the managed thread ID in each thread. — Matthew Watson, Feb 26 '16 at 07:15
@MatthewWatson ok, I reformulate my position. There are several threads (finalizers, GC) that are working, but doesn't take a lot of processor time, so we could easily presume that their impact on performance is insignificant. We could say the same about other processors. We never knows if all cores are occuped by another process, but we assume that they are not. Otherwise we souldn't implement any multithread processing because of fear if someone else is doing so. — Alex Zhukovskiy, Feb 26 '16 at 07:27
The thing to remember that having more threads than cores can be useful if one or more of the threads is blocked for some reason. However, if all the threads are active then having more threads than cores will always slow things down due to the extra context switching. I'm not saying you should never do it; I'm just explaining why the multithreaded implementation can be slower than the sequential one. — Matthew Watson, Feb 26 '16 at 08:52

score 2 · Accepted Answer · answered Feb 26 '16 at 09:52

I've done some testing with a parallel Quicksort implementation.

I tested the following code with a RELEASE build on Windows x64 10, compiled with C#6 (Visual Studio 2015), .Net 4.61, and run outside any debugger.

My processor is quad core with hyperthreading (which is certainly going to help any parallel implementation!)

The array size is 20,000,000 (so a fairly large array).

I got these results:

LINQ OrderBy()  took 00:00:14.1328090
PLINQ OrderBy() took 00:00:04.4484305
Array.Sort()    took 00:00:02.3695607
Sequential      took 00:00:02.7274400
Parallel        took 00:00:00.7874578

PLINQ OrderBy() is much faster than LINQ OrderBy(), but slower than Array.Sort().

QuicksortSequential() is around the same speed as Array.Sort()

But the interesting thing here is that QuicksortParallelOptimised() is noticeably faster on my system - so it's definitely an efficient way of sorting if you have enough processor cores.

Here's the full compilable console app. Remember to run it in RELEASE mode - if you run it in DEBUG mode the timing results will be woefully incorrect.

using System;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;

namespace Demo
{
    class Program
    {
        static void Main()
        {
            int n = 20000000;
            int[] a = new int[n];
            var rng = new Random(937525);

            for (int i = 0; i < n; ++i)
                a[i] = rng.Next();

            var b = a.ToArray();
            var d = a.ToArray();

            var sw = new Stopwatch();

            sw.Restart();
            var c = a.OrderBy(x => x).ToArray(); // Need ToArray(), otherwise it does nothing.
            Console.WriteLine("LINQ OrderBy() took " + sw.Elapsed);

            sw.Restart();
            var e = a.AsParallel().OrderBy(x => x).ToArray(); // Need ToArray(), otherwise it does nothing.
            Console.WriteLine("PLINQ OrderBy() took " + sw.Elapsed);

            sw.Restart();
            Array.Sort(d);
            Console.WriteLine("Array.Sort() took " + sw.Elapsed);

            sw.Restart();
            QuicksortSequential(a, 0, a.Length-1);
            Console.WriteLine("Sequential took " + sw.Elapsed);

            sw.Restart();
            QuicksortParallelOptimised(b, 0, b.Length-1);
            Console.WriteLine("Parallel took " + sw.Elapsed);

            // Verify that our sort implementation is actually correct!

            Trace.Assert(a.SequenceEqual(c));
            Trace.Assert(b.SequenceEqual(c));
        }

        static void QuicksortSequential<T>(T[] arr, int left, int right)
        where T : IComparable<T>
        {
            if (right > left)
            {
                int pivot = Partition(arr, left, right);
                QuicksortSequential(arr, left, pivot - 1);
                QuicksortSequential(arr, pivot + 1, right);
            }
        }

        static void QuicksortParallelOptimised<T>(T[] arr, int left, int right)
        where T : IComparable<T>
        {
            const int SEQUENTIAL_THRESHOLD = 2048;
            if (right > left)
            {
                if (right - left < SEQUENTIAL_THRESHOLD)
                {
                    QuicksortSequential(arr, left, right);
                }
                else
                {
                    int pivot = Partition(arr, left, right);
                    Parallel.Invoke(
                        () => QuicksortParallelOptimised(arr, left, pivot - 1),
                        () => QuicksortParallelOptimised(arr, pivot + 1, right));
                }
            }
        }

        static int Partition<T>(T[] arr, int low, int high) where T : IComparable<T>
        {
            int pivotPos = (high + low) / 2;
            T pivot = arr[pivotPos];
            Swap(arr, low, pivotPos);

            int left = low;
            for (int i = low + 1; i <= high; i++)
            {
                if (arr[i].CompareTo(pivot) < 0)
                {
                    left++;
                    Swap(arr, i, left);
                }
            }

            Swap(arr, low, left);
            return left;
        }

        static void Swap<T>(T[] arr, int i, int j)
        {
            T tmp = arr[i];
            arr[i] = arr[j];
            arr[j] = tmp;
        }
    }
}

I tried this code yesterday, and AFAIR this pivot is very bad when array is already sorted (for example, on `new int[20*1000*1000]`) — Alex Zhukovskiy, Feb 26 '16 at 11:03
Only naive realisation of pivot get a valid result without `cutting` one element for each call. This is why I switched to merge sort, which has no worst case. Well, now the problem is the pivot. I'l try to solve it myself. Thanks. — Alex Zhukovskiy, Feb 26 '16 at 11:27
@AlexZhukovskiy Yes, it looks like the crucial remaining thing is a good choice of pivot. — Matthew Watson, Feb 26 '16 at 11:32
I'm not sure if I understand what are you talking about. For array of zeroes this pivot will return a left bound for every call. So there will be N calls of pivot which is slow enough. Of course on real distinct data this pivot works fine, but I wanted to use this sort as generic sorting method. But it's not good when in some cases performance degrade to level which is lower than even bubblesort has. And i was looking for something that could enchance this special case. Just comment code where you are using `rng` (lines 3-6) and you will see it yourself — Alex Zhukovskiy, Feb 26 '16 at 11:42
for `n=20000` I get [following result](http://i.imgur.com/fFTsbdX.png). For `n=200000` i get stackoverflow. And i'm not surprised, because there is exactly N recursive calls. — Alex Zhukovskiy, Feb 26 '16 at 11:50
@AlexZhukovskiy I was just agreeing with you that to make this code work effectively in the already-sorted case, it needs a different algorithm for selecting the pivot. Perhaps even just choosing the centre value. — Matthew Watson, Feb 26 '16 at 11:50
All right, I just misreaded. Thanks for a comparision, I'm starting to search a good pivot algorithm. AFAIK this algorithm was invented on google hackaton, so I was suprised than it's so inefficient in some cases. — Alex Zhukovskiy, Feb 26 '16 at 11:52
@AlexZhukovskiy I expect you've already seen this, but just in case you haven't: https://en.wikipedia.org/wiki/Quicksort#Choice_of_pivot — Matthew Watson, Feb 26 '16 at 11:54
When we have an array of zeroes. It's not important how we are choosing an element within the range, because all them will be equal. Problem is than we are moving all elements greater *or equal*, that means that all zeroez will be interchanged each one with another. So there is no problem with element we choose, they all are equals, but what we are doing after. I want to understand ***how*** can we avoid a situation when we are pivoting the first element. We are selecting the middle (or median, or random, it's not important from where we are starting), and then move it to first element. — Alex Zhukovskiy, Feb 26 '16 at 12:04
Thanks for a lint, I found that it's a known **Dutch national flag** problem, and I found a solution - partition that divide a sequence on three partitions. I will implement it. Thank you for an answer, it was be very helpful (and comment was too) — Alex Zhukovskiy, Feb 26 '16 at 12:11

Fast sort partially sorted array

1 Answers1