8

I'm looking for a way to sort pandas DataFrame. pd.DataFrame.sort_values doesn't accept a key function. I can convert it to list and apply a key to sorted function, but that will be slow. The other way seems something related to categorical index. I don't have a fixed number of rows so I don't know if categorical index will be applicable.

I have given an example case of what kind of data I want to sort:

Input DataFrame:

     clouds  fluff
0    {[}      1
1    >>>      2
2     {1      3
3    123      4
4  AAsda      5
5    aad      6

Output DataFrame:

     clouds  fluff
0    >>>      2
1    {[}      1
2     {1      3
3    123      4
4    aad      6
5  AAsda      5

The rule for sorting (priority):

  • First special characters (sort by ascii among themselves)

  • Next is by numbers

  • next is by lower case alphabets (lexicographical)

  • next is Capital case alphabets (lexicographical)

In plain python I'd do it like

from functools import cmp_to_key

def ks(a, b):
    # "Not exactly this but similar"
    if a.isupper():
        return -1
    else:
        return 1

Case

sorted(['aa', 'AA', 'dd', 'DD'], key=cmp_to_key(ks))

Answer:

['DD', 'AA', 'aa', 'dd']

How would you do it with Pandas?

user4157124
  • 2,809
  • 13
  • 27
  • 42
Vasantha Ganesh
  • 4,570
  • 3
  • 25
  • 33

4 Answers4

6

As of pandas 1.1.0, pandas.DataFrame.sort_values accepts an argument key with type callable.

So in this case we would use:

df.sort_values(by='clouds', key=kf)

where kf is the key function that operates on type Series. Accepts and returns Series.

Vasantha Ganesh
  • 4,570
  • 3
  • 25
  • 33
4

As of pandas 1.2.0, I did this

import numpy as np
import pandas as pd

df = pd.DataFrame(['aa', 'dd', 'DD', 'AA'], columns=["data"])

# This is the sorting rule
rule = {
    "DD": 1,
    "AA": 10,
    "aa": 20,
    "dd": 30,
    }


def particular_sort(series):
    """
    Must return one Series
    """
    return series.apply(lambda x: rule.get(x, 1000))


new_df = df.sort_values(by=["data"], key=particular_sort)
print(new_df)  # DD, AA, aa, dd

Of course, you can do this too, but it may be difficult to understand,smile

new_df = df.sort_values(by=["data"], key=lambda x: x.apply(lambda y: rule.get(y, 1000)))
print(new_df)  # DD, AA, aa, dd
wuiover
  • 41
  • 1
2

This might be useful, yet still not sure about special characters! can they actally be sorted!!

import pandas as pd

a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']

df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])

df2 = pd.concat([df[df['int'] == True].sort_values(by=['a']), 
           df[df['lower'] == True].sort_values(by=['a']),
           df[df['upper'] == True].sort_values(by=['a'])])

print(df2)

   a    upper   lower   int
3   1   NaN     NaN     True
0   2   NaN     NaN     True
6   3   NaN     NaN     True
4   a   False   True    False
5   b   False   True    False
2   c   False   True    False
8   A   True    False   False
1   B   True    False   False
7   C   True    False   False

you can also do it in one step with creating new True False columns!

a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df2 = pd.concat([df[df['a'].apply(isinstance,args = [int])].sort_values(by=['a']), 
           df[df['a'].str.islower() == True].sort_values(by=['a']),
           df[df['a'].str.isupper() == True].sort_values(by=['a'])])

    a
3   1
0   2
6   3
4   a
5   b
2   c
8   A
1   B
7   C
Khalil Al Hooti
  • 4,207
  • 5
  • 23
  • 40
2

This seems to work:

def sort_dataframe_by_key(dataframe: DataFrame, column: str, key: Callable) -> DataFrame:
    """ Sort a dataframe from a column using the key """
    sort_ixs = sorted(np.arange(len(dataframe)), key=lambda i: key(dataframe.iloc[i][column]))
    return DataFrame(columns=list(dataframe), data=dataframe.iloc[sort_ixs].values)

It passes tests:

def test_sort_dataframe_by_key():
    dataframe = DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}])
    assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: x).equals(
           DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}]))
    assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: -x).equals(
           DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
    assert sort_dataframe_by_key(dataframe, column='b', key=lambda x: -x).equals(
           DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}]))
    assert sort_dataframe_by_key(dataframe, column='c', key=lambda x: x).equals(
           DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
Peter
  • 12,274
  • 9
  • 71
  • 86