C++ STL algorithm (list sort) OpenMP/multithreaded implementations

Question

I was attempting to accelerate one of the kernels of my code which at its essence boils down to a sort (I am running this on a CPU with several cores). I discovered from this post (STL algorithms and concurrent programming) that some of these algorithms can be accelerated, for example, using OpenMP (see below).

I got quite decent speedup using __gnu_parallel::sort

E.g.

__gnu_parallel::sort(std::begin(X), std::end(X), [](X a, X b){ return a.member > b.member;});

It turns out that std::list is a much better container for my data. But this does not seem to have a parallel/multithreaded implementation for sort.

The post linked above is was dated 2010. I was wondering what the most current wisdom is on this.

how about having number of threads as the number of CPUs you have, spliting the list into equal chunks and let each thread sort it synchonously? — David Haim, Feb 14 '16 at 10:11
*"What is the cur"* -- Did you get distracted while writing that sentence and forget to finish it? — Benjamin Lindley, Feb 14 '16 at 10:13
@DavidHaim - you'll still need to sort the result again after the threads all finish though, unless of course the list is already sorted. Intuitively, it feels to me like this approach would be worse than a naive single-threaded sort, since you've already had to sort the chunks too. Certainly some algos handle already sorted data better than others, but I think this would at best be cache-miss hell. — enhzflep, Feb 14 '16 at 10:16
Sorting `list` is always problematic, are you sure, you cannot use `deque` instead? Take a look here: http://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html for some comparisons. — Michał Walenciak, Feb 14 '16 at 10:17
Yes, I am sure we can parallelize the problem accordingly, but I was just looking for a drop in solution without having to do *too* much work. As it is, we can easily divide the problem up (I know this to be the case as the problem is embarrassingly parallel) and use MPI if we want to do some such thing. — kakrafoon, Feb 15 '16 at 08:33
In 2017 we are getting parallel algorithms ([N4407](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4407.html) amongst others) if you can wait that long... :-) Currently however you will have to roll your own. I have an implementation on my [github](https://github.com/grahamreeds/parallel_partition) page you can use as a basis for your own. — graham.reeds, Feb 15 '16 at 08:45

rcgldr · Accepted Answer · 2019-05-24T02:19:15.210

In the case of Microsoft (prior to Visual Studio 2015), std::list::sort uses an array of lists, where array[i] is either an empty list or a list of size 2 to the power i (1,2,4,8, ... ). Nodes are taken from the original list one at a time and merged into the array, then the array is merged to form a single sorted list. Assuming compare overhead isn't excessive, this is a memory bound process, and multi-threading won't help due to the overhead of scanning a list to split up the list, nearly doubling the number of memory read operations. Here is example code for this type of list sort with the restriction that the compare is < not <=, so the operands have to be reversed to keep the sort stable.

#define NUMLISTS 32                     /* number of lists */
NODE * SortList(NODE *pList)
{
NODE * aList[NUMLISTS];                 /* array of lists */
NODE * pNode;
NODE * pNext;
int i;
    if(pList == NULL)                   /* check for empty list */
        return NULL;
    for(i = 0; i < NUMLISTS; i++)       /* zero array */
        aList[i] = NULL;
    pNode = pList;                      /* merge nodes into array */
    while(pNode != NULL){
        pNext = pNode->next;
        pNode->next = NULL;
        for(i = 0; (i < NUMLISTS) && (aList[i] != NULL); i++){
            pNode = MergeLists(aList[i], pNode);
            aList[i] = NULL;
        }
        if(i == NUMLISTS)
            i--;
        aList[i] = pNode;
        pNode = pNext;
    }
    pNode = NULL;                       /* merge array into one list */
    for(i = 0; i < NUMLISTS; i++)
        pNode = MergeLists(aList[i], pNode);
    return pNode;
}

NODE * MergeLists(NODE *pSrc1, NODE *pSrc2)
{
NODE *pDst = NULL;                      /* destination head ptr */
NODE **ppDst = &pDst;                   /* ptr to head or prev->next */
    if(pSrc1 == NULL)
        return pSrc2;
    if(pSrc2 == NULL)
        return pSrc1;
    while(1){
        if(pSrc2->data < pSrc1->data){  /* if src2 < src1 */
            *ppDst = pSrc2;
            pSrc2 = *(ppDst = &pSrc2->next);
            if(pSrc2 == NULL){
                *ppDst = pSrc1;
                break;
            }
        } else {                        /* src1 <= src2 */
            *ppDst = pSrc1;
            pSrc1 = *(ppDst = &pSrc1->next);
            if(pSrc1 == NULL){
                *ppDst = pSrc2;
                break;
            }
        }
    }
    return pDst;
}

update - Visual Studio 2015 and later switched to using iterators instead of lists for the merge sort, which eliminates allocation issues like no default allocator, and since the merging is done via splice() all on the same list, it provides exception safety (if a user compare throws an exception, the list is reordered, but all the nodes are there, assuming splice never throws an exception). VS2015 also switched to a top down merge sort, but a bottom up merge sort based on iterators could have been used. I'm not sure why the switch to top down was done, as it is about 40% slower for a large list (well beyond cache size) with randomly scattered nodes. Example iterator based code. Each iterator in the array points to the first node of a run of size 2 to the power i, or it is equal to list.end(), to indicate an empty run. The end of a run will be the first prior non "empty" entry in the array or a local variable iterator (all runs in the array are adjacent runs). All merges involve adjacent runs. The merge function takes 3 parameters, an iterator to the first node of the left run, a iterator to the first node of the right run, which is also the end of the left run, and an iterator to the end of the right run (which may be an iterator to the first node of the following run or list.end()).

template <typename T>
typename std::list<T>::iterator Merge(std::list<T> &ll,
                    typename std::list<T>::iterator li,
                    typename std::list<T>::iterator ri,
                    typename std::list<T>::iterator ei);

// iterator array size
#define ASZ 32

template <typename T>
void SortList(std::list<T> &ll)
{
    if (ll.size() < 2)                  // return if nothing to do
        return;
    std::list<T>::iterator ai[ASZ];     // array of iterators
    std::list<T>::iterator li;          // left   iterator
    std::list<T>::iterator ri;          // right  iterator
    std::list<T>::iterator ei;          // end    iterator
    size_t i;
    for (i = 0; i < ASZ; i++)           // "empty" array
        ai[i] = ll.end();
    // merge nodes into array
    for (ei = ll.begin(); ei != ll.end();) {
        ri = ei++;
        for (i = 0; (i < ASZ) && ai[i] != ll.end(); i++) {
            ri = Merge(ll, ai[i], ri, ei);
            ai[i] = ll.end();
        }
        if(i == ASZ)
            i--;
        ai[i] = ri;
    }
    // merge array into single list
    ei = ll.end();                              
    for(i = 0; (i < ASZ) && ai[i] == ei; i++);
    ri = ai[i++];
    while(1){
        for( ; (i < ASZ) && ai[i] == ei; i++);
        if (i == ASZ)
            break;
        li = ai[i++];
        ri = Merge(ll, li, ri, ei);
    }
}

template <typename T>
typename std::list<T>::iterator Merge(std::list<T> &ll,
                             typename std::list<T>::iterator li,
                             typename std::list<T>::iterator ri,
                             typename std::list<T>::iterator ei)
{
    std::list<T>::iterator ni;
    (*ri < *li) ? ni = ri : ni = li;
    while(1){
        if(*ri < *li){
            ll.splice(li, ll, ri++);
            if(ri == ei)
                return ni;
        } else {
            if(++li == ri)
                return ni;
        }
    }
}

Explains. I understand that more cores does not translate to more speedup due to memory bandwidth contention, and that memcpy is taboo. So it seems that STL list sort is serial unless the user implements it himself. — kakrafoon, Feb 15 '16 at 08:37
@user2875665 - the issue is that a linked list is serial in nature. If a container with random access could be used instead, the data could be split up so that most of the early part of a sort algorithm would occur within the L1 / L2 cache of each core, reducing memory bandwidth overhead. Assuming some type of merge sort, the advantage goes away once the size of the sorted runs significantly exceed the local cache size and reverts to a memory bound process, a combination L3 cache which is common to all cores on many processors, and main memory. — rcgldr, Feb 15 '16 at 09:13
thanks for spelling out the issues with list (and problem being too large for merge to sit in cache) so clearly. — kakrafoon, Feb 15 '16 at 10:10

score 1 · Answer 2 · answered Jul 28 '20 at 13:57

1

Since C++ 17 you may consider parallel sort algorithms std::sort(std::execution::par, ...

answered Jul 28 '20 at 13:57

Sergei Krivonos

4,217
3
39
54

C++ STL algorithm (list sort) OpenMP/multithreaded implementations

2 Answers2