4

I have a dataframe A that has three columns: 'id', 'value' and 'date'. I was doing a groupby by id, and noticed something strange: if I do two successive sorts for rows with a given id, first by value and then date, the order of operations affects the order of rows. Notice the order of rows with index 42915 and 42916:

A.sort_values('value').sort_values('date')[A.sort_values('value').sort_values('date')['id'] == '0001249666']

id  value   date
42913   0001249666  113845.0    20130408
42914   0001249666  114597.0    20130430
42916   0001249666  125972.0    20140414
42915   0001249666  125971.0    20140414
42917   0001249666  136154.0    20150410
42918   0001249666  145551.0    20160407
42919   0001249666  152911.0    20170413

A[A['id'] == '0001249666'].sort_values('value').sort_values('date')

id  value   date
42913   0001249666  113845.0    20130408
42914   0001249666  114597.0    20130430
42915   0001249666  125971.0    20140414
42916   0001249666  125972.0    20140414
42917   0001249666  136154.0    20150410
42918   0001249666  145551.0    20160407
42919   0001249666  152911.0    20170413

If I however use .sort_values with two arguments, the order doesn't matter.

A.sort_values(['date','value'])[A.sort_values(['date','value'])['id'] == '0001249666']

id  value   date
42913   0001249666  113845.0    20130408
42914   0001249666  114597.0    20130430
42915   0001249666  125971.0    20140414
42916   0001249666  125972.0    20140414
42917   0001249666  136154.0    20150410
42918   0001249666  145551.0    20160407
42919   0001249666  152911.0    20170413

a[a['id'] == '0001249666'].sort_values(['date','value'])

id  value   date
42913   0001249666  113845.0    20130408
42914   0001249666  114597.0    20130430
42915   0001249666  125971.0    20140414
42916   0001249666  125972.0    20140414
42917   0001249666  136154.0    20150410
42918   0001249666  145551.0    20160407
42919   0001249666  152911.0    20170413

Now I know that what I'm doing isn't the smartest way of doing what I want, but I am really interested in understanding what could explain this behaviour? What am I implicitly assuming when I don't understand this behaviour.

ollipolli
  • 43
  • 1
  • 2

1 Answers1

3

The problem is that the sorting algorithm is not stable, so rows with the same sorted value (as it happens when you have multiple rows with the same date) may be disarranged with respect to their original relative order. This means that if you call sort your data frame twice, the first sorting will not have any (predictable) effect, since the second sorting will only sort with respect its own criterion. sort_values supports several sorting algorithms; as stated by the documentation, the default is quicksort, but you can pass kind='mergesort' to have stable sorting if you need it.

Obviously, when you pass two columns to sort by, the sorting algorithm takes them both into account at the same time, so the sorting will work as expected.

jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • Okay, this was exactly the sort of answer I was looking for. Now I know what documentation to read... Thank you! – ollipolli Mar 05 '18 at 12:22