So the problem with your code is that when you use multiple columns inside pandas.DataFrame.sort_values
, alongside the key
parameter, pandas takes each series in the order you've defined inside the parameter by=["col1", "col2"]
and calls that function passing the column values to it in the order they appeared before calling sort_values
.
For example, let's define a simple function that only prints the arguments it receives, and use it as our key
parameter:
import pandas as pd
import numpy as np
from natsort import index_natsorted, natsort_keygen, natsorted
# == Sample DataFrame ============================
df = pd.DataFrame(
{
'Col_Arg': ['First', 'Second', 'First', 'Third', 'Fourth', 'Tenth'],
'Col_Step': [20, 10, 10, 30, 20, 5]
}
)
# == Simple Key Function ============================
def print_values(x):
print(x)
return x
df.sort_values(
by=["Col_Arg", "Col_Step"],
key=lambda x: print_values(x)
)
# Returns:
"""
0 1 First
1 2 Second
2 1 First
3 3 Third
4 4 Fourth
5 10 Tenth
Name: Col_Arg, dtype: object
0 20
1 10
2 10
3 30
4 20
5 5
Name: Col_Step, dtype: int64
"""
So basically pandas.DataFrame.sort_values
passes each column as a series, to your function, and it expects your function to do some transformation to make the column "sortable". Here's the parameter description from pandas documentation:
key
: callable
, optional
Description:
Apply the key
function to the values before sorting. This is similar to the key argument in the builtin sorted()
function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.
In other words, if you want to sort both columns in the same pandas.DataFrame.sort_values
operation, you need to pass in a function that's able to convert 'Col_Arg'
to a numeric form, while returning 'Col_Step'
unmodified. Additionally, by using inFrame
in key=lambda x:np.argsort(index_natsorted(inFrame['Col_Arg']))
instead of passing x
, the key function will sort values based on the inFrame
indexes in the order they existed prior to calling the sort_values
function. Here's an example:
df.sort_values(
by=["Col_Arg", "Col_Step"],
key=lambda x: print_values(np.argsort(index_natsorted(df["Col_Step"])))
)
# Prints:
"""
[3 1 2 5 4 0]
[3 1 2 5 4 0]
"""
So, the first time the key
function gets called it sorts the dataframe indexes using [3 1 2 5 4 0]
, then it applies the same order as it did before, but now all indexes have already been moved, so it ends up ruining the sort operation.
Quick Fix
As stated previously, the key
function takes each column value in the order they exist prior to the sort operation. So we need to create a function that converts 'Col_Arg'
values to numbers, instead of trying to sort inside the key function. There's a package called number-parser that does that for you. To install it, run the following code:
pip install number-parser
Then, you can create a function to use inside key
like so:
import numpy as np
import pandas as pd
from number_parser import parse_ordinal
def custom_sort(col: pd.Series) -> pd.Series:
if col.name == "Col_Arg":
return col.apply(parse_ordinal)
return col
df.sort_values(
by=["Col_Arg", "Col_Step"],
key=custom_sort
)
# Returns:
"""
Col_Arg Col_Step
2 First 10
0 First 20
1 Second 10
3 Third 30
4 Fourth 20
5 Tenth 5
"""
Solution 2: another option would be to do something like this:
import pandas as pd
import numpy as np
from natsort import index_natsorted, natsort_keygen, natsorted
df.sort_values(
by="Col_Arg", key=lambda col: np.argsort(index_natsorted(col))
).groupby("Col_Arg", as_index=False).apply(
lambda grp: grp.sort_values("Col_Step")
).reset_index(
drop=True
)
# Returns:
"""
Col_Arg Col_Step
0 First 10
1 First 20
2 Fourth 20
3 Second 10
4 Tenth 5
5 Third 30
"""