DataFrame sorting based on a function of multiple column values

Question

Based on python, sort descending dataframe with pandas:

Given:

from pandas import DataFrame
import pandas as pd

d = {'x':[2,3,1,4,5],
     'y':[5,4,3,2,1],
     'letter':['a','a','b','b','c']}

df = DataFrame(d)

df then looks like this:

df:
      letter    x    y
    0      a    2    5
    1      a    3    4
    2      b    1    3
    3      b    4    2
    4      c    5    1

I would like to have something like:

f = lambda x,y: x**2 + y**2
test = df.sort(f('x', 'y'))

This should order the complete dataframe with respect to the sum of the squared values of column 'x' and 'y' and give me:

test:
      letter    x    y
    2      b    1    3
    3      b    4    2
    1      a    3    4
    4      c    5    1
    0      a    2    5

Ascending or descending order does not matter. Is there a nice and simple way to do that? I could not yet find a solution.

score 33 · Answer 1 · answered Jul 29 '16 at 16:18

33

You can create a temporary column to use in sort and then drop it:

df.assign(f = df['one']**2 + df['two']**2).sort_values('f').drop('f', axis=1)
Out: 
  letter  one  two
2      b    1    3
3      b    4    2
1      a    3    4
4      c    5    1
0      a    2    5

answered Jul 29 '16 at 16:18

ayhan

70,170
20
182
203

11

this seems to be the best way to go, but it sorta sucks... it would be way more elegant to pass a lambda function into `sort_values`, the same way you'd do that for python's native `sorted()` call – Alex Spangher Jun 29 '18 at 16:42
2

@AlexSpangher, looks like we still don't have this feature supported yet for now, 2020 Feb :-( – avocado Feb 07 '20 at 18:58
The advantage of python is that when it doesn't exist you can just [add the method](https://stackoverflow.com/a/62624996/1720199). – cglacet Jun 28 '20 at 16:18

andrewkittredge · Accepted Answer · 2021-05-23T16:38:05.507

15

df.loc[(df.x ** 2 + df.y ** 2).sort_values().index]

after How to sort pandas dataframe by custom order on string index

edited May 23 '21 at 16:38

answered Apr 15 '20 at 21:46

andrewkittredge

742
5
12

1

Thank you this is a realy nice solution! The index of the sorted data is used in combination with iloc. This is neat. No further column is needed. – Ohumeronen Apr 20 '20 at 13:48
3

That indeed look like the correct approach, on the other hand you should use `.loc` instead of `.iloc` because this wouldn't work with most indexes (it will only work with indexes like `list(range(n))`. I'll add an alternative this just in case. – cglacet Jun 28 '20 at 15:50
[There](https://stackoverflow.com/a/62624996/1720199) using `iloc` with `argsort` which is very similar to this strategy. – cglacet Jun 28 '20 at 16:04

Sandeep · Answer 3 · 2016-08-01T16:35:53.343

3

Have you tried to create a new column and then sorting on that. I cannot comment on the original post, so i am just posting my solution.

df['c'] = df.a**2 + df.b**2
df = df.sort_values('c')

edited Aug 01 '16 at 16:35

answered Jul 29 '16 at 16:14

Sandeep

141
6

1

The "problem" with this solution is that it actually creates another column which is not the exact goal here (input and output column should be the same). – cglacet Jun 28 '20 at 16:05

score 1 · Answer 4 · answered Jul 29 '16 at 16:18

from pandas import DataFrame
import pandas as pd

d = {'one':[2,3,1,4,5],
     'two':[5,4,3,2,1],
     'letter':['a','a','b','b','c']}

df = pd.DataFrame(d)

#f = lambda x,y: x**2 + y**2
array = []
for i in range(5):
    array.append(df.ix[i,1]**2 + df.ix[i,2]**2)
array = pd.DataFrame(array, columns = ['Sum of Squares'])
test = pd.concat([df,array],axis = 1, join = 'inner')
test = test.sort_index(by = "Sum of Squares", ascending = True).drop('Sum of Squares',axis =1)

Just realized that you wanted this:

    letter  one  two
2      b    1    3
3      b    4    2
1      a    3    4
4      c    5    1
0      a    2    5

cglacet · Answer 5 · 2020-06-28T18:30:54.430

Another approach, similar to this one is to use argsort which returns the indexes permutation directly:

f = lambda r: r.x**2 + r.y**2
df.iloc[df.apply(f, axis=1).argsort()]

I think using argsort better translates the idea than a regular sort (we don't care about the value of this computation, only about the resulting indexes).

It could also be interesting to patch the DataFrame to add this functionality:

def apply_sort(self, *, key):
    return self.iloc[self.apply(key, axis=1).argsort()]

pd.DataFrame.apply_sort = apply_sort

We can then simply write:

>>> df.apply_sort(key=f)

   x  y letter
2  1  3      b
3  4  2      b
1  3  4      a
4  5  1      c
0  2  5      a

since you do a row-wise apply here wouldnt this be trading a fair bit of performance on any vectorized operation compared to andrewkittredge's method? Does the sort vs argsort offset these concerns? — Skyler, Oct 08 '20 at 15:53

DataFrame sorting based on a function of multiple column values

5 Answers5

Linked