`std::list<>::sort()` - why the sudden switch to top-down strategy?

Question

I remember that since the beginning of times the most popular approach to implementing std::list<>::sort() was the classic Merge Sort algorithm implemented in bottom-up fashion (see also What makes the gcc std::list sort implementation so fast?).

I remember seeing someone aptly refer to this strategy as "onion chaining" approach.

At least that's the way it is in GCC's implementation of C++ standard library (see, for example, here). And this is how it was in old Dimkumware's STL in MSVC version of standard library, as well as in all versions of MSVC all the way to VS2013.

However, the standard library supplied with VS2015 suddenly no longer follows this sorting strategy. The library shipped with VS2015 uses a rather straightforward recursive implementation of top-down Merge Sort. This strikes me as strange, since top-down approach requires access to the mid-point of the list in order to split it in half. Since std::list<> does not support random access, the only way to find that mid-point is to literally iterate through half of the list. Also, at the very beginning it is necessary to know the total number of elements in the list (which was not necessarily an O(1) operation before C++11).

Nevertheless, std::list<>::sort() in VS2015 does exactly that. Here's an excerpt from that implementation that locates the mid-point and performs recursive calls

...
iterator _Mid = _STD next(_First, _Size / 2);
_First = _Sort(_First, _Mid, _Pred, _Size / 2);
_Mid = _Sort(_Mid, _Last, _Pred, _Size - _Size / 2);
...

As you can see, they just nonchalantly use std::next to walk through the first half of the list and arrive at _Mid iterator.

What could be the reason behind this switch, I wonder? All I see is a seemingly obvious inefficiency of repetitive calls to std::next at each level of recursion. Naive logic says that this is slower. If they are willing to pay this kind of price, they probably expect to get something in return. What are they getting then? I don't immediately see this algorithm as having better cache behavior (compared to the original bottom-up approach). I don't immediately see it as behaving better on pre-sorted sequences.

Granted, since C++11 std::list<> is basically required to store its element count, which makes the above slightly more efficient, since we always know the element count in advance. But that still does not seem to be enough to justify the sequential scan on each level of recursion.

(Admittedly, I haven't tried to race the implementations against each other. Maybe there are some surprises there.)

"which was not necessarily an O(1) operation before C++11" is irrelevant. They are writing it for their own implementation, which has O(1) `size()`. — T.C., Nov 16 '16 at 01:38
@T.C.: Yes, but O(1) `size()` does not make much of a difference here. It is only useful once - at the very topmost level of recursion. Having O(1) `size()` alone is not enough to justify this algorithm. The primary issue I have with this is O(n) `std::next` at *each* level of recursion and that is not really related to O(1) `size()` at the very top. — AnT stands with Russia, Nov 16 '16 at 01:46
You are perfectly right: this implementation cannot be optimal. Which only goes to prove that standard implementations cannot be assumed to be always optimal. Especially in standard C++ implementations there seems to be quite a bit of cruft. — cmaster - reinstate monica, Nov 16 '16 at 09:35
@cmaster: Your statement is just wrong. Note that the theoretical price of finding the midpoint is O(N), and we do it at O(log N) depths, so the total cost is O(N log N). Sorting was and is O(N log N) anyway, so the theoretical bound does not change. And for the practical performance, you need to account for real hardware. — MSalters, Nov 16 '16 at 10:01
@mSalters The complexity is not changed, and I never said it was. However, by scanning twice up to the midpoint of the list, the algorithm looses a *constant factor* of time, making the algorithm suboptimal. If we were to go by complexity alone, we would have to use straight radix sort all the time because it's `O(n)`, which is a better complexity than the `O(log(n))` that quicksort & co. achieve. Nevertheless, straight radix sort has such a high *constant summand* that it's slower than quicksort in all relevant cases, rendering straight radix sort useless. Never forget the constants! — cmaster - reinstate monica, Nov 16 '16 at 10:29
@cmaster: Radix sort can't work for `std::list::sort` because that only requires `std::less`. As for the constant factor, that's making assumptions about the behavior of cache. The list walk seems to be the fastest way to get the first half of the list into cache. — MSalters, Nov 16 '16 at 11:55
@MSalters - list walk doesn't help if the nodes of a large list are randomly scattered throughout memory, in which case you get one node per cache line (see my answer for comparison of bottom up versus top down). — rcgldr, Nov 16 '16 at 12:04
@mSalters Please read rcgldr's answer and then reevaluate your assertion that my "statement is just wrong". I lacked the proof that rcgldr unearthed (he definitely earned my +1 for that), but I can read code and assess its performance impacts (which is what I did and what my statement was based upon). So, please, be a little slower in the future when calling other peoples statements false. — cmaster - reinstate monica, Nov 16 '16 at 12:55
Straight from the horse's mouth: _"I did that to avoid memory allocation and default constructing allocators."_ – [Stephan T. Lavavej](https://twitter.com/StephanTLavavej/status/798939566761656320) — sbi, Nov 17 '16 at 11:15
@sbi - Thanks. I updated my answer. I didn't know who to contact other than P. J. Plauger since his is the only name shown in the copyrights for the STL files (at least all of them that I've read). — rcgldr, Nov 18 '16 at 05:34
@AnT - The reason given was "to avoid memory allocation and default constructing allocators", which was accomplished by using iterators, and T.C added to that "free basic exception safety". which was accomplished since the merging is done via splices that now only operate on the caller's list. However, there wasn't a need to switch to a top down strategy, since it turns out that a bottom up strategy also works with iterators (a small array of them) and the same merge via splice on the caller's list. I updated my answer to include example code for this (it's near the top of my answer now). — rcgldr, Mar 05 '20 at 05:45
@sbi - The switch to using iterators instead of an array of lists avoided allocation issues and also provided exception safety, but the switch to top down wasn't needed, as a bottom up approach can also be implemented using iterators and the same merge logic. I updated my answer with an explanation and example code to show how this could be done. This is a late comment, since I thought I had made this comment long ago. — rcgldr, Mar 05 '20 at 11:33
@rcgldr: You might tell that to STL, not me. I was only the messenger. — sbi, Jul 11 '20 at 20:45
@sbi - Back at the time, I corresponded with P J Plauger of Dinkumware, the company that has/had been working with Microsoft on the templates that are based on 1994 or prior template code written by HP (look at the copyright at the end of the template source files). I do know that Dinkumware did not adopt the top down approach used in Visual Studio, but I don't know what they ended up doing. — rcgldr, Jul 11 '20 at 22:21
@sbi - the last time I checked, Visual Studio 2019 was still using top down. One of the team members (I don't know who) claimed performance wasn't a goal for the templates (which is disputed by P J Plauger, and the prior HP team), and that for performance, it would be faster to copy the list to an array, sort the array, then create a sorted list from the array (seems like an excuse for a poor choice in top down). In my opinion, I think the people involved in the switch to iterators didn't realize that the prior bottom up approach could be adapted to work with iterators. — rcgldr, Jul 11 '20 at 22:26
@rcgldr: You might tell that to STL, not me. I was only the messenger. — sbi, Aug 09 '20 at 15:35
@sbi - I think I sent an email to someone at STL, but don't remember who. As commented above, P J Plauger, initially reverted back to the prior bottom up version and they were going to come up with an alternate fix. I don't know what happened there. I never got a response from STL. My impression is that it's low priority for Microsoft, and P J Plauger has an alternate solution by now. — rcgldr, Aug 09 '20 at 16:46
@rcgldr: "STL" refers to Stephan T. Lavavej, the guy who runs the std lib dev department at Microsoft, and who I was quoting. I don't think you can send an email "to someone at Stephan". — sbi, Aug 20 '20 at 07:45
@sbi - I sent him a tweet on his twitter account back in April 24 2019. I never got a response. — rcgldr, Aug 20 '20 at 16:55

rcgldr · Accepted Answer · 2023-01-29T20:50:57.890

Note this answer has been updated to address all of the issues mentioned in the comments below and after the question, by making the same change from an array of lists to an array of iterators, while retaining the faster bottom up merge sort algorithm, and eliminating the small chance of stack overflow due to recursion with the top down merge sort algorithm.

Initially I assumed that Microsoft would not have switched to a less efficient top down merge sort when it switched to using iterators unless it was necessary, so I was looking for alternatives. It was only when I tried to analyze the issues (out of curiosity) that I realized that the original bottom up merge sort could be modified to work with iterators.

In @sbi's comment, he asked the author of the top down approach, Stephan T. Lavavej, why the change to iterators was made. Stephan's response was "to avoid memory allocation and default constructing allocators". VS2015 introduced non-default-constructible and stateful allocators, which presents an issue when using the prior version's array of lists, as each instance of a list allocates a dummy node, and a change would be needed to handle no default allocator.

Lavavej's solution was to switch to using iterators to keep track of run boundaries within the original list instead of an internal array of lists. The merge logic was changed to use 3 iterator parameters, 1st parameter is iterator to start of left run, 2nd parameter is iterator to end of left run == iterator to start of right run, 3rd parameter is iterator to end of right run. The merge process uses std::list::splice to move nodes within the original list during merge operations. This has the added benefit of being exception safe. If a caller's compare function throws an exception, the list will be re-ordered, but no loss of data will occur (assuming splice can't fail). With the prior scheme, some (or most) of the data would be in the internal array of lists if an exception occurred, and data would be lost from the original list.

I changed bottom up merge sort to use an array of iterators instead of an array of lists, where array[i] is an iterator to the start of a sorted run with 2^i nodes, or it is empty (using std::list::end to indicate empty, since iterators can't be null). Similar to the top down approach, the array of iterators is only used to keep track of sorted run boundaries within the original linked list, with the same merge logic as top down that uses std::list::splice to move nodes within the original linked list.

A single scan of the list is done, building up sorted runs to the left of the current scan.next position according to the sorted run boundaries in the array, until all nodes are merged into the sorted runs. Then the sorted runs are merged, resulting in a sorted list.

For example, for a list with 7 nodes, after the scan:

2                       1           0          array index
run0->run0->run0->run0->run1->run1->run2->end

Then the 3 sorted runs are merged right to left via merge(left, right), so that the sort is stable.

If a linked list is large and the nodes are scattered, there will be a lot of cache misses, and top down will be about 40% to 50% slower than bottom up depending on the processor. Then again, if there's enough memory, it would usually be faster to move the list to an array or vector, sort the array or vector, then create a new list from the sorted array or vector.

Example C++ code:

template <typename T>
typename std::list<T>::iterator Merge(std::list<T> &ll,
                    typename std::list<T>::iterator li,
                    typename std::list<T>::iterator ri,
                    typename std::list<T>::iterator ei);

// iterator array size
#define ASZ 32

template <typename T>
void SortList(std::list<T> &ll)
{
    if (ll.size() < 2)                  // return if nothing to do
        return;
    typename std::list<T>::iterator ai[ASZ]; // array of iterator (bgn lft)
    typename std::list<T>::iterator ri;      // right    iterator (end lft, bgn rgt)
    typename std::list<T>::iterator ei;      // end      iterator (end rgt)
    size_t i;
    for (i = 0; i < ASZ; i++)           // "clear" array
        ai[i] = ll.end();
    // merge nodes into array of runs
    for (ei = ll.begin(); ei != ll.end();) {
        ri = ei++;
        for (i = 0; (i < ASZ) && ai[i] != ll.end(); i++) {
            ri = Merge(ll, ai[i], ri, ei);
            ai[i] = ll.end();
        }
        if(i == ASZ)
            i--;
        ai[i] = ri;
    }
    // merge array of runs into single sorted list
    // ei = ll.end();                              
    for(i = 0; (i < ASZ) && ai[i] == ei; i++);
    ri = ai[i++];
    while(1){
        for( ; (i < ASZ) && ai[i] == ei; i++);
        if (i == ASZ)
            break;
        ri = Merge(ll, ai[i++], ri, ei);
    }
}

template <typename T>
typename std::list<T>::iterator Merge(std::list<T> &ll,
                             typename std::list<T>::iterator li,
                             typename std::list<T>::iterator ri,
                             typename std::list<T>::iterator ei)
{
    typename std::list<T>::iterator ni;
    (*ri < *li) ? ni = ri : ni = li;
    while(1){
        if(*ri < *li){
            ll.splice(li, ll, ri++);
            if(ri == ei)
                return ni;
        } else {
            if(++li == ri)
                return ni;
        }
    }
}

Example replacement code for VS2019's std::list::sort(), in include file list. The merge logic was made into a separate internal function, since it's now used in two places. The call to _Sort from std::list::sort() is _Sort(begin(), end(), _Pred, this->_Mysize());, where _Pred is a pointer to the compare function (defaults to std::less()).

private:
    template <class _Pr2>
    iterator _Merge(_Pr2 _Pred, iterator _First, iterator _Mid, iterator _Last){
        iterator _Newfirst = _First;
        for (bool _Initial_loop = true;;
            _Initial_loop       = false) { // [_First, _Mid) and [_Mid, _Last) are sorted and non-empty
            if (_DEBUG_LT_PRED(_Pred, *_Mid, *_First)) { // consume _Mid
                if (_Initial_loop) {
                    _Newfirst = _Mid; // update return value
                }
                splice(_First, *this, _Mid++);
                if (_Mid == _Last) {
                    return _Newfirst; // exhausted [_Mid, _Last); done
                }
            }
            else { // consume _First
                ++_First;
                if (_First == _Mid) {
                    return _Newfirst; // exhausted [_First, _Mid); done
                }
            }
        }
    }

    template <class _Pr2>
    void _Sort(iterator _First, iterator _Last, _Pr2 _Pred,
        size_type _Size) { // order [_First, _Last), using _Pred, return new first
                           // _Size must be distance from _First to _Last
        if (_Size < 2) {
            return;        // nothing to do
        }
        const size_t _ASZ = 32;         // array size
        iterator _Ai[_ASZ];             // array of   iterator to run (bgn lft)
        iterator _Mi;                   // middle     iterator to run (end lft, bgn rgt)
        iterator _Li;                   // last (end) iterator to run (end rgt)
        size_t _I;                      // index to _Ai
        for (_I = 0; _I < _ASZ; _I++)   // "empty" array
            _Ai[_I] = _Last;            //   _Ai[] == _Last => empty entry
        // merge nodes into array of runs
        for (_Li = _First; _Li != _Last;) {
            _Mi = _Li++;
            for (_I = 0; (_I < _ASZ) && _Ai[_I] != _Last; _I++) {
                _Mi = _Merge(_Pass_fn(_Pred), _Ai[_I], _Mi, _Li);
                _Ai[_I] = _Last;
            }
            if (_I == _ASZ)
                _I--;
            _Ai[_I] = _Mi;
        }
        // merge array of runs into single sorted list
        for (_I = 0; _I < _ASZ && _Ai[_I] == _Last; _I++);
        _Mi = _Ai[_I++];
        while (1) {
            for (; _I < _ASZ && _Ai[_I] == _Last; _I++);
            if (_I == _ASZ)
                break;
            _Mi = _Merge(_Pass_fn(_Pred), _Ai[_I++], _Mi, _Last);
        }
    }

The remainder of this answer is historical, and only left for the historical comments, otherwise it is no longer relevant.

I was able to reproduce the issue (old sort fails to compile, new one works) based on a demo from @IgorTandetnik:

#include <iostream>
#include <list>
#include <memory>

template <typename T>
class MyAlloc : public std::allocator<T> {
public:
    MyAlloc(T) {}  // suppress default constructor
    
    template <typename U>
    MyAlloc(const MyAlloc<U>& other) : std::allocator<T>(other) {}
    
    template< class U > struct rebind { typedef MyAlloc<U> other; };
};

int main()
{
    std::list<int, MyAlloc<int>> l(MyAlloc<int>(0));
    l.push_back(3);
    l.push_back(0);
    l.push_back(2);
    l.push_back(1);
    l.sort();
    return 0;
}

I noticed this change back in July, 2016 and emailed P.J. Plauger about this change on August 1, 2016. A snippet of his reply:

Interestingly enough, our change log doesn't reflect this change. That probably means it was "suggested" by one of our larger customers and got by me on the code review. All I know now is that the change came in around the autumn of 2015. When I reviewed the code, the first thing that struck me was the line:
    iterator _Mid = _STD next(_First, _Size / 2);
which, of course, can take a very long time for a large list.

The code looks a bit more elegant than what I wrote in early 1995(!), but definitely has worse time complexity. That version was modeled after the approach by Stepanov, Lee, and Musser in the original STL. They are seldom found to be wrong in their choice of algorithms.

I'm now reverting to our latest known good version of the original code.

I don't know if P.J. Plauger's reversion to the original code dealt with the new allocator issue, or if or how Microsoft interacts with Dinkumware.

For a comparison of the top down versus bottom up methods, I created a linked list with 4 million elements, each consisting of one 64 bit unsigned integer, assuming I would end up with a doubly linked list of nearly sequentially ordered nodes (even though they would be dynamically allocated), filled them with random numbers, then sorted them. The nodes don't move, only the linkage is changed, but now traversing the list accesses the nodes in random order. I then filled those randomly ordered nodes with another set of random numbers and sorted them again. I compared the 2015 top down approach with the prior bottom up approach modified to match the other changes made for 2015 (sort() now calls sort() with a predicate compare function, rather than having two separate functions). These are the results. update - I added a node pointer based version and also noted the time for simply creating a vector from list, sorting vector, copy back.

sequential nodes: 2015 version 1.6 seconds, prior version 1.5  seconds
random nodes:     2015 version 4.0 seconds, prior version 2.8  seconds
random nodes:                  node pointer based version 2.6  seconds
random nodes:    create vector from list, sort, copy back 1.25 seconds

For sequential nodes, the prior version is only a bit faster, but for random nodes, the prior version is 30% faster, and the node pointer version 35% faster, and creating a vector from the list, sorting the vector, then copying back is 69% faster.

Below is the first replacement code for std::list::sort() I used to compare the prior bottom up with small array (_BinList[]) method versus VS2015's top down approach I wanted the comparison to be fair, so I modified a copy of < list >.

    void sort()
        {   // order sequence, using operator<
        sort(less<>());
        }

    template<class _Pr2>
        void sort(_Pr2 _Pred)
        {   // order sequence, using _Pred
        if (2 > this->_Mysize())
            return;
        const size_t _MAXBINS = 25;
        _Myt _Templist, _Binlist[_MAXBINS];
        while (!empty())
            {
            // _Templist = next element
            _Templist._Splice_same(_Templist.begin(), *this, begin(),
                ++begin(), 1);
            // merge with array of ever larger bins
            size_t _Bin;
            for (_Bin = 0; _Bin < _MAXBINS && !_Binlist[_Bin].empty();
                ++_Bin)
                _Templist.merge(_Binlist[_Bin], _Pred);
            // don't go past end of array
            if (_Bin == _MAXBINS)
                _Bin--;
            // update bin with merged list, empty _Templist
            _Binlist[_Bin].swap(_Templist);
            }
            // merge bins back into caller's list
            for (size_t _Bin = 0; _Bin < _MAXBINS; _Bin++)
                if(!_Binlist[_Bin].empty())
                    this->merge(_Binlist[_Bin], _Pred);
        }

I made some minor changes. The original code kept track of the actual maximum bin in a variable named _Maxbin, but the overhead in the final merge is small enough that I removed the code associated with _Maxbin. During the array build, the original code's inner loop merged into a _Binlist[] element, followed by a swap into _Templist, which seemed pointless. I changed the inner loop to just merge into _Templist, only swapping once an empty _Binlist[] element is found.

Below is a node pointer based replacement for std::list::sort() I used for yet another comparison. This eliminates allocation related issues. If a compare exception is possible and occurred, all the nodes in the array and temp list (pNode) would have to be appended back to the original list, or possibly a compare exception could be treated as a less than compare.

    void sort()
        {   // order sequence, using operator<
        sort(less<>());
        }

    template<class _Pr2>
        void sort(_Pr2 _Pred)
        {   // order sequence, using _Pred
        const size_t _NUMBINS = 25;
        _Nodeptr aList[_NUMBINS];           // array of lists
        _Nodeptr pNode;
        _Nodeptr pNext;
        _Nodeptr pPrev;
        if (this->size() < 2)               // return if nothing to do
            return;
        this->_Myhead()->_Prev->_Next = 0;  // set last node ->_Next = 0
        pNode = this->_Myhead()->_Next;     // set ptr to start of list
        size_t i;
        for (i = 0; i < _NUMBINS; i++)      // zero array
            aList[i] = 0;
        while (pNode != 0)                  // merge nodes into array
            {
            pNext = pNode->_Next;
            pNode->_Next = 0;
            for (i = 0; (i < _NUMBINS) && (aList[i] != 0); i++)
                {
                pNode = _MergeN(_Pred, aList[i], pNode);
                aList[i] = 0;
                }
            if (i == _NUMBINS)
                i--;
            aList[i] = pNode;
            pNode = pNext;
            }
        pNode = 0;                          // merge array into one list
        for (i = 0; i < _NUMBINS; i++)
            pNode = _MergeN(_Pred, aList[i], pNode);
        this->_Myhead()->_Next = pNode;     // update sentinel node links
        pPrev = this->_Myhead();            //  and _Prev pointers
        while (pNode)
            {
            pNode->_Prev = pPrev;
            pPrev = pNode;
            pNode = pNode->_Next;
            }
        pPrev->_Next = this->_Myhead();
        this->_Myhead()->_Prev = pPrev;
        }

    template<class _Pr2>
        _Nodeptr _MergeN(_Pr2 &_Pred, _Nodeptr pSrc1, _Nodeptr pSrc2)
        {
        _Nodeptr pDst = 0;          // destination head ptr
        _Nodeptr *ppDst = &pDst;    // ptr to head or prev->_Next
        if (pSrc1 == 0)
            return pSrc2;
        if (pSrc2 == 0)
            return pSrc1;
        while (1)
            {
            if (_DEBUG_LT_PRED(_Pred, pSrc2->_Myval, pSrc1->_Myval))
                {
                *ppDst = pSrc2;
                pSrc2 = *(ppDst = &pSrc2->_Next);
                if (pSrc2 == 0)
                    {
                    *ppDst = pSrc1;
                    break;
                    }
                }
            else
                {
                *ppDst = pSrc1;
                pSrc1 = *(ppDst = &pSrc1->_Next);
                if (pSrc1 == 0)
                    {
                    *ppDst = pSrc2;
                    break;
                    }
                }
            }
        return pDst;
        }

This implementation suffers from the same problem as GCC's: it doesn't properly handle non-default-constructible allocators or stateful ones. In Dinkumware's case, it also causes dynamic allocation because their `list` has a dynamically allocated sentinel node. The problem isn't unfixable, of course. — T.C., Nov 16 '16 at 10:41
Dinkumware's sentinel node is allocated on the heap (or, by the allocator), not embedded in the list object itself. — T.C., Nov 16 '16 at 12:09
`_Templist` and `_Binlist` are default-constructed. They are not necessarily default constructible (because their allocator need not be). — T.C., Nov 16 '16 at 16:10
No. Why don't you write a trivial `AllocatorWithoutADefaultConstructor` and try it? You'll see what I mean really soon. — T.C., Nov 16 '16 at 17:03
You pass a copy of the allocator to the constructor taking an allocator argument, of course. You simply can't default construct such a list. — T.C., Nov 17 '16 at 04:10
"else a compile time error occurs" That's pretty much the whole point... The old implementation simply doesn't compile if the allocator doesn't have a default constructor. — T.C., Nov 18 '16 at 03:53
You can always pass `this->get_allocator()`, but that only fixes `1/ (3*(_MAXBINS + 1))` of the problem. See my answer. — T.C., Nov 18 '16 at 05:03
"Comparator function defines a strict weak ordering" is part of the contract. "Comparator function does not throw" is not. — T.C., May 15 '18 at 08:53
How does it compare to making a vector of the node-pointers, sorting, and then re-linking the list in the new order? — Deduplicator, Apr 24 '19 at 10:29
@Deduplicator - Assuming a large list with randomly scattered nodes (so a cache hit on every node), it would be much faster to move the data into a vector, sort the vector, then create a new list, due to cache locality issues). To be exception safe, the original list would need to be kept until the sort completed, requiring O(n) data space. — rcgldr, Apr 24 '19 at 14:17
Well, that would depend on swapping items being relatively cheap. And it adds to the containers requirements on the element-type: https://en.cppreference.com/w/cpp/container/list — Deduplicator, Apr 24 '19 at 14:28
@Deduplicator - I may not be understanding your last comment. Merge sort on an array or vector of data doesn't do any swapping. The two input runs are read sequentially and merging writes sequentially, which goes pretty fast with 3 available cache lines. Sorting an array or vector of pointers could end up getting a cache hit on every access if the amount of data is much larger than cache size. — rcgldr, Apr 24 '19 at 14:47
Well, there are very few requirements for `std::list::element_type`. Sorting the pointers and re-linking the nodes adds the fewest additional ones (just having a consistent order). Moving to contiguous space, sorting, and moving back needs much more. — Deduplicator, Apr 24 '19 at 15:45
@Deduplicator - I wasn't concerned about space requirements, only the time. If space is an issue, then using vector pointers requires O(n) pointer space, and it doesn't solve the issue of a large and scattered list where there's a cache hit on every access to a node in the list. — rcgldr, Apr 24 '19 at 16:09
No, I meant requirements on the types behavior, disregarding size. Yes, not adding copyability and/or moveability means one cannot reduce the problem of bad cache-locality, but one hardly ever gets everything. — Deduplicator, Apr 24 '19 at 16:12
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/192330/discussion-between-rcgldr-and-deduplicator). — rcgldr, Apr 24 '19 at 16:15

score 10 · Answer 2 · edited May 23 '17 at 12:24

@sbi asked Stephan T. Lavavej, MSVC's standard library maintainer, who responded:

I did that to avoid memory allocation and default constructing allocators.

To this I'll add "free basic exception safety".

To elaborate: the pre-VS2015 implementation suffers from several defects:

_Myt _Templist, _Binlist[_MAXBINS]; creates a bunch of intermediate lists (_Myt is simply a typedef for the current instantiation of list; a less confusing spelling for that is, well, list) to hold the nodes during sorting, but these lists are default constructed, which leads to a multitude of problems:
1. If the allocator used is not default constructible (and there is no requirement that allocators be default constructible), this simply won't compile, because the default constructor of list will attempt to default construct its allocator.
2. If the allocator used is stateful, then a default-constructed allocator may not compare equal to this->get_allocator(), which means that the later splices and merges are technically undefined behavior and may well break in debug builds. ("Technically", because the nodes are all merged back in the end, so you don't actually deallocate with the wrong allocator if the function successfully completes.)
3. Dinkumware's list uses a dynamically allocated sentinel node, which means that the above will perform _MAXBINS + 1 dynamic allocations. I doubt that many people expect sort to potentially throw bad_alloc. If the allocator is stateful, then these sentinel nodes may not be even allocated from the right place (see #2).
The code is not exception safe. In particular, the comparison is allowed to throw, and if it throws while there are elements in the intermediate lists, those elements are simply destroyed with the lists during stack unwinding. Users of sort don't expect the list to be sorted if sort throws an exception, of course, but they probably also don't expect the elements to go missing.
- This interacts very poorly with #2 above, because now it's not just technical undefined behavior: the destructor of those intermediate lists will be deallocating and destroying the nodes spliced into them with the wrong allocator.

Are those defects fixable? Probably. #1 and #2 can be fixed by passing get_allocator() to the constructor of the lists:

 _Myt _Templist(get_allocator());
 _Myt _Binlist[_MAXBINS] = { _Myt(get_allocator()), _Myt(get_allocator()), 
                             _Myt(get_allocator()),  /* ... repeat _MAXBINS times */ };

The exception safety problem can be fixed by surrounding the loop with a try-catch that splices all the nodes in the intermediate lists back into *this without regard to order if an exception is thrown.

Fixing #3 is harder, because that means not using list at all as the holder of nodes, which probably requires a decent amount of refactoring, but it's doable.

The question is: is it worth jumping through all these hoops to improve the performance of a container that has reduced performance by design? After all, someone who really cares about performance probably won't be using list in the first place.

@rcgldr Neither stateful nor non-default-constructible allocators were a thing standard-wise before C++11. C++03 required allocators to be default constructible and permitted implementations to assume that they are stateless. I don't understand your question w/r/t sentinel nodes. Most `list` operations don't require construction of temporary `list`s. — T.C., Nov 18 '16 at 09:19
If they are using a C++03-style default-constructible stateless allocator, nothing changes. If they are using a stateful or non-default-constructible one, they should know what they are doing. — T.C., Nov 18 '16 at 10:46
@rcgldr Now try a comparator that throws on the 42nd invocation. — T.C., Nov 19 '16 at 01:24
@rcgldr I did discuss exception safety issues in my answer, didn't I? — T.C., Nov 19 '16 at 01:27
Yes, nobody said it's unimplementable. The question is whether it's worth the effort. — T.C., Nov 21 '16 at 00:41
@rcgldr The specification for `list::sort` has had a provision providing for what happens in case of exceptions since C++98. So it was very much a concern for "this part of the STL". — T.C., Oct 19 '17 at 22:29
I updated my answer to show a stand alone version of bottom up merge sort that uses a small array of iterators, which should eliminate allocator and exception issues. — rcgldr, Apr 23 '19 at 14:39
What part of "nobody said it's unimplementable" is hard to understand? — T.C., Apr 23 '19 at 14:54
Since my answer has been updated to use the same change from using lists to using iterators, I deleted my now obsolete comments. "Worth the effort" - once I considered using iterators, switching the prior bottom up merge sort from an array of lists to an array of iterators only took a few hours to analyze the issues, and get it working. I suspect the switch to iterators and top down took just as long or longer, unless there was an existing example to base VS2015's change on. I don't think this was an issue of "effort", but instead recognizing that bottom up could be changed to use iterators. — rcgldr, Apr 04 '20 at 16:29
Continuing, the reason I didn't originally consider iterators was due to the VS2015 change to top down, leading me to believe there was some issue with trying to change the existing bottom up algorithm to use iterators. It was only when I tried to analyze the switch to iterators myself that I realized there was a solution. — rcgldr, Apr 04 '20 at 16:52

`std::list<>::sort()` - why the sudden switch to top-down strategy?

2 Answers2

Linked