3

I am currently working on a recursively parallel Quicksort extension function for the List class. The code below represents the most basic thread distribution criteria I've considered because it should be the simplest to conceptually explain. It branches to the depth of the base-2 logarithm of the number of detected processors, and proceeds sequentially from there. Thus, each CPU should get one thread with a (roughly) equal, large share of data to process, avoiding excessive overhead time. The basic sequential algorithm is provided for comparison.

public static class Quicksort
{
    /// <summary>
    /// Helper class to hold information about when to parallelize
    /// </summary>
    /// <attribute name="maxThreads">Maximum number of supported threads</attribute>
    /// <attribute name="threadDepth">The depth to which new threads should
    ///                               automatically be made</attribute>
    private class ThreadInfo
    {
        internal int maxThreads;
        internal int threadDepth;

        public ThreadInfo(int length)
        {
            maxThreads = Environment.ProcessorCount;
            threadDepth = (int)Math.Log(maxThreads, 2);
        }

    }

    /// <summary>
    /// Helper function to perform the partitioning step of quicksort
    /// </summary>
    /// <param name="list">The list to partition</param>
    /// <param name="start">The starting index</param>
    /// <param name="end">The ending index/param>
    /// <returns>The final index of the pivot</returns>
    public static int Partition<T>(this List<T> list, int start, int end) where T: IComparable
    {
        int middle = (int)(start + end) / 2;

        // Swap pivot and first item.
        var temp = list[start];
        list[start] = list[middle];
        list[middle] = temp;
        var pivot = list[start];

        var swapPtr = start;

        for (var cursor = start + 1; cursor <= end; cursor++)
        {
            if (list[cursor].CompareTo(pivot) < 0)
            {
                // Swap cursor element and designated swap element
                temp = list[cursor];
                list[cursor] = list[++swapPtr];
                list[swapPtr] = temp;
            }
        }

        // Swap pivot with final lower item
        temp = list[start];
        list[start] = list[swapPtr];
        list[swapPtr] = temp;

        return swapPtr;
    }

    /// <summary>
    /// Method to begin parallel quicksort algorithm on a Comparable list
    /// </summary>
    /// <param name="list">The list to sort</param>
    public static void QuicksortParallel<T>(this List<T> list) where T : IComparable
    {
        if (list.Count < 2048)
            list.QuicksortSequential();
        else
        {
            var info = new ThreadInfo(list.Count);
            list.QuicksortRecurseP(0, list.Count - 1, 0, info);
        }
    }

    /// <summary>
    /// Method to implement parallel quicksort recursion on a Comparable list
    /// </summary>
    /// <param name="list">The list to sort</param>
    /// <param name="start">The starting index of the partition</param>
    /// <param name="end">The ending index of the partition (inclusive)</param>
    /// <param name="depth">The current recursive depth</param>
    /// <param name="info">Structure holding decision-making info for threads</param>
    private static void QuicksortRecurseP<T>(this List<T> list, int start, int end, int depth,
                                             ThreadInfo info)
                                             where T : IComparable
    {
        if (start >= end)
            return;

        int middle = list.Partition(start, end);

        if (depth < info.threadDepth)
        {
            var t = Task.Run(() =>
            {
                list.QuicksortRecurseP(start, middle - 1, depth + 1, info);
            });

            list.QuicksortRecurseP(middle + 1, end, depth + 1, info);
            t.Wait();
        }
        else
        {
            list.QuicksortRecurseS(start, middle - 1);
            list.QuicksortRecurseS(middle + 1, end);
        }
    }

    /// <summary>
    /// Method to begin sequential quicksort algorithm on a Comparable list
    /// </summary>
    /// <param name="list">The list to sort</param>
    public static void QuicksortSequential<T>(this List<T> list) where T : IComparable
    {
        list.QuicksortRecurseS(0, list.Count - 1);
    }

    /// <summary>
    /// Method to implement sequential quicksort recursion on a Comparable list
    /// </summary>
    /// <param name="list">The list to sort</param>
    /// <param name="start">The starting index of the partition</param>
    /// <param name="end">The ending index of the partition (inclusive)</param>
    private static void QuicksortRecurseS<T>(this List<T> list, int start, int end) where T : IComparable
    {
        if (start >= end)
            return;

        int middle = list.Partition(start, end);

        // Now recursively sort the (approximate) halves.
        list.QuicksortRecurseS(start, middle - 1);
        list.QuicksortRecurseS(middle + 1, end);
    }
}

As far as I understand, this methodology should produce a one-time startup cost, then proceed in sorting the rest of the data significantly faster than sequential method. However, the parallel method takes significantly more time than the sequential method, which actually increases as the load increases. Benchmarked at a list of ten million items on a 4-core CPU, the sequential method averages around 18 seconds to run to completion, while the parallel method takes upwards of 26 seconds. Increasing the allowed thread depth rapidly exacerbates the problem.

Any help in finding the performance hog is appreciated. Thanks!

  • List is not thread safe and you have no `lock` statements, so I'm a little surprised it works at all. On the other hand you could maybe [get away with using a plain array](http://stackoverflow.com/questions/1460634/are-c-sharp-arrays-thread-safe), which would also perform a bit better. I don't see any reason to use a `List` because the array never grows or shrinks. – John Wu Feb 22 '17 at 23:48
  • It works because the List is broken into disjoint partitions, so two threads are never working on the same portion at the same time. The reason a List is used is due to specification - it has to be an extension function for List, and converting it to an array and back felt excessive. I can try that though. – J. Kleinfeld Feb 23 '17 at 00:03

1 Answers1

4

The problem is CPU cache conflict, also known as "false sharing."

Unless the memory address of the pivot point happens to fall on a cache line, one of the threads will get a lock on the L1 or L2 cache and the other one will have to wait. This can make performance even worse than a serial solution. The problem is described well in this article:

...where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. [1,2] This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.

(snip)

In most cases, the parallel code ran actually ran slower than the sequential code, and in no case did we get any better than a 42% speedup no matter how many cores we threw at the problem.

Grasping at straws.... you may be able to do better if you separate the list into two objects, and pin them in memory (or even just pad them within a struct) so they are far enough apart that they won't have a cache conflict.

Community
  • 1
  • 1
John Wu
  • 50,556
  • 8
  • 44
  • 80
  • Thank you for this answer, it makes a lot of sense. I tried the suggestion in your previous comment (operating on arrays instead of Lists) and it improved the parallel performance significantly. Do you know if the List vs. array distinction is somehow related to this problem? – J. Kleinfeld Feb 23 '17 at 00:50
  • I imagine the code inside the List class shares some private member variables, while access to an array just requires a pointer and an offset. Maybe there is also cache contention within List ...? – John Wu Feb 23 '17 at 00:58
  • @J.Kleinfeld List's indeed do use the arrays internally, however, the list's initial capacity is equal `4`, in general, and it's being re-allocating the memory to double-sized array, which can degrade the performance, and it can store the array in different areas in memory, so there are a lot of misses. Arrays are stored much more efficiently. – VMAtm Feb 23 '17 at 01:34
  • @VMAtm this is completely irrelevant to sorting. – Mike Nakis Feb 23 '17 at 06:24
  • @JohnWu this is a very exotic answer to a very simple problem. The array is partitioned, and each thread works on a separate partition, which means that two threads will simultaneously access adjacent memory locations on extremely rare occasions, as in, when one thread is writing one of the last few elements of one partition while another thread happens to be writing one of the first few elements of the next partition. – Mike Nakis Feb 23 '17 at 06:30
  • 3
    You are right in that the problem is probably cache contention, but probably only because list keeps incrementing its internal `version` member. See https://referencesource.microsoft.com/#mscorlib/system/collections/generic/list.cs – Mike Nakis Feb 23 '17 at 06:36