Pandas DateFrame.concat() screws up index sorting

Question

This is an addition to Sort Pandas DataFrame by numbers-as-string column but I think it deserves its own scope.

I can apply the solution from the linked topic to a single data frame. But once I combine multiple of these DFs, the previously established order is lost again.

I start off with two data structures that have these odd numbers, represented as strings.

data = [
dict(name = 'test1', index = '1' , status='fail'),
dict(name = 'test3', index = '3', status='pass'),
dict(name = 'test1', index = '11', status='pass'),
dict(name = 'test1', index = '1 2 14 56', status='fail'),
dict(name = 'test1', index = '33' , status='pass'),
dict(name = 'test3', index = '20', status='fail'),
dict(name = 'test1', index = '2' , status='fail'),
dict(name = 'test1', index = '22' , status='fail'),
dict(name = 'test3', index = '5:1:50', status='pass'),]

data1 = [
dict(name = 'test1', index = '1' , status='fail'),
dict(name = 'test3', index = '3', status='fail'),
dict(name = 'test1', index = '11', status='pass'),
dict(name = 'test1', index = '1 2 14 56', status='fail'),
dict(name = 'test1', index = '33' , status='pass'),
dict(name = 'test3', index = '20', status='pass'),
dict(name = 'test1', index = '2' , status='fail'),]

First I convert them into individual data frames.

df = pd.DataFrame(data)
df1 = pd.DataFrame(data1)

Now I have e.g. df as:

    name      index status
0  test1          1   fail
1  test3          3   pass
2  test1         11   pass
3  test1  1 2 14 56   fail
4  test1         33   pass
5  test3         20   fail
6  test1          2   fail
7  test1         22   fail
8  test3     5:1:50   pass

Next I massage the two DFs to create a sorted multi-index that handles the "numeric strings" non-lexically. (For details please refer to the topic linked above)

dfs = dict()
for i,d in enumerate((df, df1)):
    d = (d.assign(
          _tmpIdx=d['index'].str.extract(r'([\d]+)').astype(int))
         .sort_values(['name', '_tmpIdx'])
         .drop('_tmpIdx', axis=1)
         .set_index(['name', 'index'])
        )
     dfs[i] = d

Now e.g. df looks like this (note that the index column is sorted non-lexically, despite the values being strings):

                status
name  index           
test1 1           fail
      1 2 14 56   fail
      2           fail
      11          pass
      22          fail
      33          pass
test3 3           pass
      5:1:50      pass
      20          fail

I now concatenate the two DFs

summary = pd.concat(dfs.values(), axis=1, keys=dfs.keys())

Unfortunately, this resets the previous sorting to lexical:

                     0      1
                status status
name  index                  
test1 1           fail   fail
      1 2 14 56   fail   fail
      11          pass   pass
      2           fail   fail
      22          fail    NaN
      33          pass   pass
test3 20          fail   pass
      3           pass   fail
      5:1:50      pass    NaN

How to keep the sorting for the inner level of the overall index? Is there maybe a better way of accomplishing this?

Do you need `summary = pd.concat(dfs.values(), axis=1, keys=dfs.keys(), sort=False)` ? — jezrael, Dec 04 '20 at 10:54
BTW: this is already the default (`sort: bool, default False`) — twil, Dec 04 '20 at 12:35

score 0 · Accepted Answer · answered Dec 04 '20 at 16:12

I found a solution myself, which I want to share.

Instead of applying the custom sorting to each individual DateFrame prior to concatenating them, I rather do it on the result of the concatenation.

As above, I first set up a multi-index for for each of my raw DFs (same as in question). This is needed to not have these columns repeated in the concatenation result.

dfs = dict()
for i,d in enumerate((df, df1)):
    d = d.sort_values(['name','index']).set_index(['name','index'])
    dfs[i] = d

The result will still be sorted on index lexically this time. df, e.g. looks now like this

                status
name  index           
test1 1           fail
      1 2 14 56   fail
      11          pass
      2           fail
      22          fail
      33          pass
test3 20          fail
      3           pass
      5:1:50      pass

The concatenation remains unchanged:

s = pd.concat(dfs.values(), axis=1, keys=dfs.keys())

Now I first flatten the multi-index ...

s = s.reset_index()

    name      index      0      1
                    status status
0  test1          1   fail   fail
1  test1  1 2 14 56   fail   fail
2  test1         11   pass   pass
3  test1          2   fail   fail
4  test1         22   fail    NaN
5  test1         33   pass   pass
6  test3          2    NaN   fail
7  test3         20   fail   pass
8  test3          3   pass   fail
9  test3     5:1:50   pass    NaN

... and then apply the custom sorting and re-establish the multi-index:

s = (s.assign(_tmpIdx=s['index'].str.extract(r'([\d]+)').astype(int))
          .sort_values(['name', '_tmpIdx'])
          .set_index(['name', 'index'])
          .drop(['_tmpIdx'],axis=1)
      )

This gives me the result I want:

                     0      1
                status status
name  index                  
test1 1           fail   fail
      1 2 14 56   fail   fail
      2           fail   fail
      11          pass   pass
      22          fail    NaN
      33          pass   pass
test3 2            NaN   fail
      3           pass   fail
      5:1:50      pass    NaN
      20          fail   pass

Not sure if this is the most elegant way of doing this kind of thing. But it works! Of course I am open for improvements :)

Pandas DateFrame.concat() screws up index sorting

1 Answers1