4

I have a Python Pandas Dataframe, in which a column named status contains three kinds of possible values: ok, must read x more books, does not read any books yet, where x is an integer higher than 0.

I want to sort status values according to the order above.

Example:

  name    status
0 Paul    ok
1 Jean    must read 1 more books
2 Robert  must read 2 more books
3 John    does not read any book yet

I've found some interesting hints, using Pandas Categorical and map but I don't know how to deal with variable values modifying strings.

How can I achieve that?

jpp
  • 159,742
  • 34
  • 281
  • 339
Kfcaio
  • 442
  • 1
  • 8
  • 20
  • To be clear, do you need integers within your strings to be ordered, e.g. "must read 20 books" comes *after* "must read 3 books" ? – jpp Aug 01 '18 at 15:12
  • Yes, but not only that. The order have to be 'ok' first, 'must read 1 more books',...,'must read n more books' and, finally, 'does not read any book yet'. – Kfcaio Aug 01 '18 at 15:15

2 Answers2

10

Use:

a = df['status'].str.extract('(\d+)', expand=False).astype(float)

d = {'ok': a.max() + 1, 'does not read any book yet':-1}

df1 = df.iloc[(-df['status'].map(d).fillna(a)).argsort()]
print (df1)
     name                      status
0    Paul                          ok
2  Robert      must read 2 more books
1    Jean      must read 1 more books
3    John  does not read any book yet

Explanation:

  1. First extract integers by regex \d+
  2. Then dynamically create dictionary for map non numeric values
  3. Replace NaNs by fillna for numeric Series
  4. Get positions by argsort
  5. Select by iloc for sorted values
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 3
    your `'ok'` and `'does not read any book yet'` values should be switched, and the order of the rows reversed. – asongtoruin Aug 01 '18 at 15:24
2

You can use sorted with a custom function to calculate the indices which would be sort an array (much like numpy.argsort). Then feed to pd.DataFrame.iloc:

df = pd.DataFrame({'name': ['Paul', 'Jean', 'Robert', 'John'],
                   'status': ['ok', 'must read 20 more books',
                              'must read 3 more books', 'does not read any book yet']})

def sort_key(x):
    if x[1] == 'ok':
        return -1
    elif x[1] == 'does not read any book yet':
        return np.inf
    else:
        return int(x[1].split()[2])

idx = [idx for idx, _ in sorted(enumerate(df['status']), key=sort_key)]

df = df.iloc[idx, :]

print(df)

     name                      status
0    Paul                          ok
2  Robert      must read 3 more books
1    Jean     must read 20 more books
3    John  does not read any book yet
jpp
  • 159,742
  • 34
  • 281
  • 339