This is an addition to Sort Pandas DataFrame by numbers-as-string column but I think it deserves its own scope.
I can apply the solution from the linked topic to a single data frame. But once I combine multiple of these DFs, the previously established order is lost again.
I start off with two data structures that have these odd numbers, represented as strings.
data = [
dict(name = 'test1', index = '1' , status='fail'),
dict(name = 'test3', index = '3', status='pass'),
dict(name = 'test1', index = '11', status='pass'),
dict(name = 'test1', index = '1 2 14 56', status='fail'),
dict(name = 'test1', index = '33' , status='pass'),
dict(name = 'test3', index = '20', status='fail'),
dict(name = 'test1', index = '2' , status='fail'),
dict(name = 'test1', index = '22' , status='fail'),
dict(name = 'test3', index = '5:1:50', status='pass'),]
data1 = [
dict(name = 'test1', index = '1' , status='fail'),
dict(name = 'test3', index = '3', status='fail'),
dict(name = 'test1', index = '11', status='pass'),
dict(name = 'test1', index = '1 2 14 56', status='fail'),
dict(name = 'test1', index = '33' , status='pass'),
dict(name = 'test3', index = '20', status='pass'),
dict(name = 'test1', index = '2' , status='fail'),]
First I convert them into individual data frames.
df = pd.DataFrame(data)
df1 = pd.DataFrame(data1)
Now I have e.g. df
as:
name index status
0 test1 1 fail
1 test3 3 pass
2 test1 11 pass
3 test1 1 2 14 56 fail
4 test1 33 pass
5 test3 20 fail
6 test1 2 fail
7 test1 22 fail
8 test3 5:1:50 pass
Next I massage the two DFs to create a sorted multi-index that handles the "numeric strings" non-lexically. (For details please refer to the topic linked above)
dfs = dict()
for i,d in enumerate((df, df1)):
d = (d.assign(
_tmpIdx=d['index'].str.extract(r'([\d]+)').astype(int))
.sort_values(['name', '_tmpIdx'])
.drop('_tmpIdx', axis=1)
.set_index(['name', 'index'])
)
dfs[i] = d
Now e.g. df
looks like this
(note that the index column is sorted non-lexically, despite the values being strings):
status
name index
test1 1 fail
1 2 14 56 fail
2 fail
11 pass
22 fail
33 pass
test3 3 pass
5:1:50 pass
20 fail
I now concatenate the two DFs
summary = pd.concat(dfs.values(), axis=1, keys=dfs.keys())
Unfortunately, this resets the previous sorting to lexical:
0 1
status status
name index
test1 1 fail fail
1 2 14 56 fail fail
11 pass pass
2 fail fail
22 fail NaN
33 pass pass
test3 20 fail pass
3 pass fail
5:1:50 pass NaN
How to keep the sorting for the inner level of the overall index? Is there maybe a better way of accomplishing this?