9

Pandas 0.12.0

In the DataFrame below, why for example does it jumble the indexes? Look at the 4's, the indexes go from 1,15,6,7. What is the reasoning pandas is using to decide how to order, I would have suspected the indexes to remain sequential for an equal value.

mydf=pd.DataFrame(np.random.randint(1, 6, 20),columns=["stars"])
mydf.sort(['stars'], ascending=False)


     stars
19   5
14   5
1    4
15   4
6    4
7    4
4    3
12   3
18   3
8    2
2    2
9    2
10   2
11   2
13   2
16   2
5    1
3    1
17   1
0    1
foobarbecue
  • 6,780
  • 4
  • 28
  • 54
Brian Feeny
  • 441
  • 4
  • 14

2 Answers2

10

Actually, if you look into the source code of pandas DataFrame, you'll see that sort() is just a wrapper of sort_index() with different parameter, and, as @Jeff said in this question, sort_index() is prefered method to use.

The sort_index() method using numpy.argsort() with default kind=quicksort, if you're sorting only by one column. And quicksort() is not stable, that's why your index looks shuffled.

But you can pass kind parameter to sort_index() (one of 'mergesort', 'quicksort', 'heapsort'), so you can use stable sort ('mergesort') for your task:

>>> mydf.sort_index(by=['stars'], ascending=False, kind='mergesort')
    stars
17      5
11      5
6       5
1       5
19      4
18      4
15      4
14      4
7       4
5       4
2       4
10      3
8       3
4       3
16      2
12      2
9       2
3       2
13      1
0       1

sort_index() also using mergesort (or counting sort) if there're more that one column in by parameter, it's interesting, for example, you can do this:

>>> mydf.sort_index(by=['stars', 'stars'], ascending=False)
    stars
1       5
6       5
11      5
17      5
2       4
5       4
7       4
14      4
15      4
18      4
19      4
4       3
8       3
10      3
3       2
9       2
12      2
16      2
0       1
13      1

Now the sort is stable, but indexes are sorted ascending

Community
  • 1
  • 1
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
7

Pandas is using numpy's quicksort. Quicksort involves swapping positions of the items. It stops once they are in the requested order (which in this case, does not involve checking the indices because you didn't ask for that column to be checked). Quicksort is much more efficient than a naive sort algorithm such as bubble sort, which might be what you have in mind-- it would leave the individual numbers closer to their original order, but require more steps to do so.

foobarbecue
  • 6,780
  • 4
  • 28
  • 54
  • 1
    Thanks, do you know if this behavior has changed recently (the way the sort was implemented)? – Brian Feeny Oct 25 '13 at 04:54
  • 1
    Well, you can see the changes for yourself, but mostly they are minor and hardly any in 2013: https://github.com/pydata/pandas/blame/master/pandas/core/frame.py#L2591 If you switched versions you can go on github and look at the code differences between them, or review the changelist: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html. I doubt that sort algorithm has changed, ever. – foobarbecue Oct 25 '13 at 05:02
  • 1
    So just looking into this, there are some changes in the way pandas dataframe behaved between 0.11.0 and 0.12.0. When you call sort it calls sort_index. sort_index basically passes the work over to np.argsort(). In 0.11.0 it calls argsort with no specific kind of sort I can see, but in 0.12.0 it calls it with quicksort, using kind=quicksort. Its clear the authors were cautious of quicksort in earlier pandas. – Brian Feeny Oct 25 '13 at 16:39
  • Interesting. I saw a reference to lexsort in DataFrame.sort_index , and read that lexsort is a stable sort, and also read that mergesort is the only stable sort available in numpy, so I figured it was using mergesort. However, appears actually to use quicksort as you say so I've edited the answer. Note that adding kind=quicksort probably has no effect because quicksort is the default for both np.sort() and np.argsort() . Also, the fact that it's an indirect is kind of neither here nor there, probably shouldn't have had that in the answer. – foobarbecue Oct 25 '13 at 16:46
  • 1
    unfortunately you can't pass "kind" to df.sort() (perhaps you can to df.sort_index()), as the signature does not allow it. It would be nice if they allowed external parameters to pass through, so that np.argsort() could pick it up...... – Brian Feeny Oct 25 '13 at 20:46
  • 1
    That would be nice. Just stick *kwargs in there. Maybe this should be your first contribution to pandas. – foobarbecue Oct 25 '13 at 20:48
  • I'm sorry, but indirect sort has nothing to do with quicksort or bubblesort. Indirect sort is just sorting of references to the records (or, in case of numpy or pandas, sorting of indexes) - http://en.wikiversity.org/wiki/Sorting_data#Indirect_sort. You can use any algorithm you want for indirect sort, even bubblesort. The problem of OP arises because DataFrame uses quicksort by default, and quicksort is unstable. – Roman Pekar Oct 27 '13 at 07:46
  • @RomanPekar, er, you are correct. I've removed the bit about it being indirect. – foobarbecue Apr 25 '18 at 17:59