Sort Pandas DataFrame by string column that represents (mostly) numbers?

Question

I have data similar to this.

data = [
dict(name = 'test1', index = '1' , status='fail'),
dict(name = 'test3', index = '3', status='pass'),
dict(name = 'test1', index = '11', status='pass'),
dict(name = 'test1', index = '1 2 14 56', status='fail'),
dict(name = 'test3', index = '20', status='fail'),
dict(name = 'test1', index = '2' , status='fail'),
dict(name = 'test3', index = '5:1:50', status='pass'),]

Note, that the type of the 'index' column is str. Since it has some irregular entries, I cannot easily convert this to a numeric type. (If this was possible I would not have this question.)

First I convert it into a DataFrame:

df = pd.DataFrame(data)

This gives me

    name    index     status
0   test1   1         fail
1   test3   3         pass
2   test1   11        pass
3   test1   1 2 14 56 fail
4   test3   20        fail
5   test1   2         fail
6   test3   5:1:50    pass

Next I sort it:

df1 = df.sort_values(by=['name','index'])

Since the 'index' column is 'str', it will be sorted lexically.

    name    index     status
0   test1   1         fail
3   test1   1 2 14 56 fail
2   test1   11        pass
5   test1   2         fail
4   test3   20        fail
1   test3   3         pass
6   test3   5:1:50    pass

What I actually want is this:

    name    index     status
0   test1   1         fail
5   test1   2         fail
2   test1   11        pass
3   test1   1 2 14 56 fail
1   test3   3         pass
4   test3   20        fail
6   test3   5:1:50    pass

The irregular values in row numbers 4 and 7 (DF indices 3 and 6) could also go to the beginning of each test group. The key point is, that the values of the 'index' column, that could be converted to a numerical representation, shall be sorted numerically. And preferably in-place. How?

rhug123 · Answer 1 · 2020-12-04T19:41:41.243

0

One possibility is to make a column that will give you the length of the index.

df['sort'] = df['index'].str.len()
df['sort2'] = df['index'].str[0]
df1 = df.sort_values(by=['name','sort','sort2'])
df1 = df1.drop(columns = ['sort','sort2'])

edited Dec 04 '20 at 19:41

answered Dec 03 '20 at 15:00

rhug123

7,893
1
9
24

I tried it and it works for the data set in the question. Unfortunately it will break if I add some more entries with number strings of the same length, e.g. add `22` and `33` to `test1`. – twil Dec 03 '20 at 15:48
I believe I fixed it. – rhug123 Dec 04 '20 at 19:41

Pierre D · Accepted Answer · 2020-12-03T17:53:56.627

0

This will sort by the name and a temporary column (__ix) that is the first integer found (consecutive digits) in each 'index' string:

Update: You can also use:

df = (
    df
    .assign(
        __ix=df['index'].str.extract(r'([0-9]+)').astype(int)
    )
    .sort_values(['name', '__ix'])
    .drop('__ix', axis=1)  # optional: remove the tmp column
    .reset_index(drop=True)  # optional: leaves the index scrambled
)

Original:

df = (
    df
    .assign(
        __ix=df['index']
        .apply(lambda s: int(re.match(r'\D*(\d+)', s).group(0)))
    )
    .sort_values(['name', '__ix'])
    .drop('__ix', axis=1)
    .reset_index(drop=True)
)

On your data (thanks for providing an easy reproducible example), first check what that __ix column is:

df['index'].apply(lambda s: int(re.match(r'\D*(\d+)', s).group(0)))
# out:
0     1
1     3
2    11
3     1
4    20
5     2
6     5

After sorting, your df becomes:

    name      index status
0  test1          1   fail
1  test1  1 2 14 56   fail
2  test1          2   fail
3  test1         11   pass
4  test3          3   pass
5  test3     5:1:50   pass
6  test3         20   fail

edited Dec 03 '20 at 17:53

answered Dec 03 '20 at 15:57

Pierre D

24,012
7
60
96

@Pierre_D Thanks for this generic solution. Since I asked for `in-place` I guess I have to accept it is a somewhat scary, convoluted expression :) But you did a great job in formatting it in a way it is easier to digest and breaking it down into its relevant parts for explanation. One question: why is the top level assignment taking the right hand side wrapped in parentheses? – twil Dec 03 '20 at 16:28
oh, generally speaking I prefer this style of [method chaining](https://tomaugspurger.github.io/method-chaining.html) rather than a bunch of assignments and mutations of the original. In the expression above, everything is done without side-effects (no in-place modifications) except the assignment itself back to `df`. You could assign to a different variable (`df2 = (...)`) and leave `df` pristine and unmodified. In terms of the formatting: putting the whole expression in parentheses allows for clean multiline splitting of it. – Pierre D Dec 03 '20 at 17:45
I also updated the expression to use `.str.extract()` instead of an `.apply(lambda ...)`, so that would probably speed up things quite a bit on large df. And note that the last two operations are optional cleanups. And of course, you're free to use anything instead of `__ix`; just a column name that doesn't exist in your df. – Pierre D Dec 03 '20 at 17:57
Sure, there is nothing wrong with method chaining. In shell command line, one-liners with chained pipes are what you use quite often. The thing is that you need to be able to make sense of your code also after some time. I am skipping through languages quite frequently and with changing focus. So for me Python is just another language. Admittedly a quite powerful one. So when it comes to features I use less frequently I try to do my future self a favour and make it understandable. Thats why I also don't like the guru-like one-liner answers without explanation here at SO. – twil Dec 04 '20 at 08:41

Sort Pandas DataFrame by string column that represents (mostly) numbers?

2 Answers2

Linked