2

Description

Long story short, I need a way to sort a DataFrame by a specific column, given a specific function which is analagous to usage of "key" parameter in python built-in sorted() function. Yet there's no such "key" parameter in pd.DataFrame.sort_value() function.

The approach used for now

I have to create a new column to store the "scores" of a specific row, and delete it in the end. The problem of this approach is that the necessity to generate a column name which does not exists in the DataFrame, and it could be more troublesome when it comes to sorting by multiple columns.

I wonder if there's a more suitable way for such purpose, in which there's no need to come up with a new column name, just like using a sorted() function and specifying parameter "key" in it.

Update: I changed my implementation by using a new object instead of generating a new string beyond those in the columns to avoid collision, as shown in the code below.

Code

Here goes the example code. In this sample the DataFrame is needed to be sort according to the length of the data in row "snippet". Please don't make additional assumptions on the type of the objects in each rows of the specific column. The only thing given is the column itself and a function object/lambda expression (in this example: len) that takes each object in the column as input and produce a value, which is used for comparison.

def sort_table_by_key(self, ascending=True, key=len):
    """
    Sort the table inplace.
    """
    # column_tmp = "".join(self._table.columns)
    column_tmp = object() # Create a new object to avoid column name collision.
    # Calculate the scores of the objects. 
    self._table[column_tmp] = self._table["snippet"].apply(key)
    self._table.sort_values(by=column_tmp, ascending=ascending, inplace=True)
    del self._table[column_tmp]

1 Answers1

1

Now this is not implemented, check github issue 3942.

I think you need argsort and then select by iloc:

df = pd.DataFrame({
    'A': ['assdsd','sda','affd','asddsd','ffb','sdb','db','cf','d'],
    'B': list(range(9))
})
print (df)
        A  B
0  assdsd  0
1     sda  1
2    affd  2
3  asddsd  3
4     ffb  4
5     sdb  5
6      db  6
7      cf  7
8       d  8

def sort_table_by_length(column, ascending=True):
    if ascending:
        return df.iloc[df[column].str.len().argsort()]
    else:
        return df.iloc[df[column].str.len().argsort()[::-1]]

print (sort_table_by_length('A'))
        A  B
8       d  8
6      db  6
7      cf  7
1     sda  1
4     ffb  4
5     sdb  5
2    affd  2
0  assdsd  0
3  asddsd  3

print (sort_table_by_length('A', False))
        A  B
3  asddsd  3
0  assdsd  0
2    affd  2
5     sdb  5
4     ffb  4
1     sda  1
7      cf  7
6      db  6
8       d  8

How it working:

First get lengths to new Series:

print (df['A'].str.len())
0    6
1    3
2    4
3    6
4    3
5    3
6    2
7    2
8    1
Name: A, dtype: int64

Then get indices by sorted values by argmax, for descending ordering is used this solution:

print (df['A'].str.len().argsort())
0    8
1    6
2    7
3    1
4    4
5    5
6    2
7    0
8    3
Name: A, dtype: int64

Last change ordering by iloc:

print (df.iloc[df['A'].str.len().argsort()])
        A  B
8       d  8
6      db  6
7      cf  7
1     sda  1
4     ffb  4
5     sdb  5
2    affd  2
0  assdsd  0
3  asddsd  3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I apologize for not stating my problems clearly. The problem is that I need a general way (instead of those suits only strings or 1d np.ndarray(which is my case)) to a sort DataFrame by a specific column, given a pre-defined function that takes each objects (they are not necessarily strings) in that column as input and produce a "score", according to which the DataFrame would be sorted (just like the "key" parameter in built-in sorted() function. I don't wan't to create a new column for the DataFrame to store those "scores" in case the names of columns collides. – Boris Polonsky Feb 23 '18 at 12:29
  • @BorisPolonsky - Unfortunately this is not implemented [yet](https://github.com/pandas-dev/pandas/issues/3942) in pandas. So now does not exist some general solution. Need create sorting numeric Series like `df[column].str.len()` and then use `argsort`. No new column is added. – jezrael Feb 23 '18 at 12:34
  • How about I create a Series that stores the scores outside the data, sort the series itself, and use it for .iloc to avoid column name collision. – Boris Polonsky Feb 23 '18 at 12:49
  • @BorisPolonsky - this do exacly my solution. – jezrael Feb 23 '18 at 12:50