3

The goal of my question is to understand why this happens and if this is a defined behaviour. I need to know to design my unittests in a predictable way. I do not want or need to change that behaviour or work around it.

Here is the initial data on the left side complete and on the right side just all ID.eq(1) but the order is the same as you can see in the index and the val column.

|    |   ID | val   |      |    |   ID | val   |                               
|---:|-----:|:------|      |---:|-----:|:------|                               
|  0 |    1 | A     |      |  0 |    1 | A     |                               
|  1 |    2 | B     |      |  3 |    1 | x     |                               
|  2 |    9 | C     |      |  4 |    1 | R     |                               
|  3 |    1 | x     |      |  6 |    1 | G     |                               
|  4 |    1 | R     |      |  9 |    1 | a     |                               
|  5 |    4 | F     |      | 12 |    1 | d     |                               
|  6 |    1 | G     |      | 13 |    1 | e     |                               
|  7 |    9 | H     |                                     
|  8 |    4 | I     |                                     
|  9 |    1 | a     |                                     
| 10 |    2 | b     |                                     
| 11 |    9 | c     |                                     
| 12 |    1 | d     |                                     
| 13 |    1 | e     |                                     
| 14 |    4 | f     |                                     
| 15 |    2 | g     |                                     
| 16 |    9 | h     |                                     
| 17 |    9 | i     |                                     
| 18 |    4 | X     |                                     
| 19 |    5 | Y     |                                     

This right table is also the result I would expected when doing the following: When I sort by ID the order of the rows inside the subgroups (e.g. ID.eq(1)) is modified. Why is it so?

This is the unexpected result

|    |   ID | val   |
|---:|-----:|:------|
|  0 |    1 | A     |
| 13 |    1 | e     |
| 12 |    1 | d     |
|  6 |    1 | G     |
|  9 |    1 | a     |
|  3 |    1 | x     |
|  4 |    1 | R     |

This is a full MWE

#!/usr/bin/env python3
import pandas as pd

# initial data
df = pd.DataFrame(
    {
        'ID': [1, 2, 9, 1, 1, 4, 1, 9, 4, 1,
               2, 9, 1, 1, 4, 2, 9, 9, 4, 5],
        'val': list('ABCxRFGHIabcdefghiXY')
    }
)
print(df.to_markdown())

# only the group "1"
print(df.loc[df.ID.eq(1)].to_markdown())

# sort by 'ID'
df = df.sort_values('ID')
# only the group "1" (after sorting)
print(df.loc[df.ID.eq(1)].to_markdown())
buhtz
  • 10,774
  • 18
  • 76
  • 149

1 Answers1

3

As explained in the sort_values documentation, the stability of the sort is not always guaranteed depending on the chosen algorithm:

kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'
     Choice of sorting algorithm. See also :func:`numpy.sort` for more
     information. `mergesort` and `stable` are the only stable algorithms. For
     DataFrames, this option is only applied when sorting on a single
     column or label.

If you want to ensure using a stable sort:

df.sort_values('ID', kind='stable')

output:

   ID val
0   1   A
3   1   x
4   1   R
6   1   G
9   1   a
...
mozway
  • 194,879
  • 13
  • 39
  • 75