9

I have a dataframe a pandas dataframe with the following columns:

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

Which I am hoping to sort primarily by column 'two', then by column 'one'. For the secondary sort, I would like to use a custom sorting rule that will sort column 'one' by the alphabetic character [A-Z] and then the trailing numeric number [0-100]. So, the outcome of the sort would be:

one two
 A1   1
 B1   1
 A2   1
 A1   2
 B1   2
 A2   2

I have sorted a list of strings similar to column 'one' before using a sorting rule like so:

def custom_sort(value):
    return (value[0], int(value[1:]))

my_list.sort(key=custom_sort)

If I try to apply this rule via a pandas sort, I run into a number of issues including:

  1. The pandas DataFrame.sort_values() function accepts a key for sorting like the sort() function, but the key function should be vectorized (per the pandas documentation). If I try to apply the sorting key to only column 'one', I get the error "TypeError: cannot convert the series to <class 'int'>"
  2. When you use the pandas DataFrame.sort_values() method, it applies the sort key to all columns you pass in. This will not work since I want to sort first by the column 'two' using a native numerical sort.

How would I go about sorting the DataFrame as mentioned above?

user11058068
  • 153
  • 1
  • 6

4 Answers4

3

You can split column one into its constituent parts, add them as columns to the dataframe and then sort on them with column two. Finally, remove the temporary columns.

>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
       .sort_values(['two', 'rhs', 'lhs'])
       .drop(columns=['lhs', 'rhs']))
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2
0  A2    2
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • Thanks for your help. This answer worked, you just need to switch the order of the lhs and rhs in the sort_values() call to get the expected sort order (looks like my wording in the question was a bit off though). – user11058068 Oct 14 '20 at 15:10
  • Is there a way to sort the columns using "natural sort" (e.g. via the `natsort` package) and avoid creating temporary columns? – Agostino Oct 19 '21 at 13:15
  • @Agostino to do with natsort, see answer [here](https://stackoverflow.com/a/63890954/8285811) or [there](https://stackoverflow.com/a/62432033/8285811) – Akaisteph7 Jul 08 '22 at 21:40
1

use str.extract to create some temp columns that are based off 1) alphabet (a-zA-Z]+) and 2) Number (\d+) and then drop them:

df = pd.DataFrame([
    ['A2', 2],
    ['B1', 1],
    ['A1', 2],
    ['A2', 1],
    ['B1', 2],
    ['A1', 1]], 
  columns=['one','two'])

df['one-letter'] = df['one'].str.extract('([a-zA-Z]+)')
df['one-number'] = df['one'].str.extract('(\d+)')
df = df.sort_values(['two', 'one-number', 'one-letter']).drop(['one-letter', 'one-number'], axis=1)
df
Out[38]: 
  one  two
5  A1    1
1  B1    1
3  A2    1
2  A1    2
4  B1    2
David Erickson
  • 16,433
  • 2
  • 19
  • 35
1

One of the solutions is to make both columns pd.Categorical and pass the expected order as an argument "categories".

But I have some requirements where I cannot coerce unknown\unexpected values and unfortunately that is what pd.Categorical is doing. Also None is not supported as a category and coerced automatically.

So my solution was to use a key to sort on multiple columns with a custom sorting order:

import pandas as pd


df = pd.DataFrame([
    [A2, 2],
    [B1, 1],
    [A1, 2],
    [A2, 1],
    [B1, 2],
    [A1, 1]], 
  columns=['one','two'])


def custom_sorting(col: pd.Series) -> pd.Series:
    """Series is input and ordered series is expected as output"""
    to_ret = col
    # apply custom sorting only to column one:
    if col.name == "one":
        custom_dict = {}
        # for example ensure that A2 is first, pass items in sorted order here:
        def custom_sort(value):
            return (value[0], int(value[1:]))

        ordered_items = list(col.unique())
        ordered_items.sort(key=custom_sort)
        # apply custom order first:
        for index, item in enumerate(ordered_items):
            custom_dict[item] = index
        to_ret = col.map(custom_dict)
    # default text sorting is about to be applied
    return to_ret


# pass two columns to be sorted
df.sort_values(
    by=["two", "one"],
    ascending=True,
    inplace=True,
    key=custom_sorting,
)

print(df)

Output:

5  A1    1
3  A2    1
1  B1    1
2  A1    2
0  A2    2
4  B1    2

Be aware that this solution can be slow.

Ievgen
  • 4,261
  • 7
  • 75
  • 124
0

With pandas >= 1.1.0 and natsort, you can also do this now:

import natsort

sorted_df = df.sort_values(["one", "two"], key=natsort.natsort_keygen())
Akaisteph7
  • 5,034
  • 2
  • 20
  • 43