0

I have a pandas dataframe that I want to sort by one of the columns. Problem is, the data that needs to be sorted looks like this: "'Number 1' - Specimen 'Number 2'". I want to sort by 'Number 1' first, and then 'Number 2'.

An example:

import pandas as pd

df = pd.DataFrame({'Name': ['12001 - Specimen 10', '12000 - Specimen 1', '12000 - Specimen 10', 
                            '12000 - Specimen 2', '12000 - Specimen 5', '12001 - Specimen 1', 
                            '12001 - Specimen 2'], 
                   'Results': [2, 4, 2, 3, 10, 8, 2]})

df.sort_values('Name')
   Name                   Results
1  12000 - Specimen 1     4
2  12000 - Specimen 10    2
3  12000 - Specimen 2     3
4  12000 - Specimen 5     10
5  12001 - Specimen 1     8
0  12001 - Specimen 10    2
6  12001 - Specimen 2     2

This correctly sorts by the first number, but for the seconds number it puts 10 before 2.

I have seen two similar questions being posted, but in those cases they had the numbers and strings separated by an '_', and the answers suggested splitting them before sorting. I tried doing something similar, but it ended up only sorting by the second number.

sort dataFrame index containing string and number

Sort DataFrame index that has a string and number

If possible, I would prefer something that can be done purely in pandas with no other packages needed.

meakis
  • 3
  • 3
  • "12001 - Specimen 10" is a string, so essentially they are sorted alphabetically (or based on string's ASCII values). I would parse the string via a regular expression first and put the Specimen number (of type int) into another column. In this case, you can sort by Specimen easily. – Shao-Chuan Wang Aug 18 '21 at 06:59
  • Or you can pass the result extracted from regular expression as a `key` when you do `sort_values` so you don't have to add another column. – Henry Yik Aug 18 '21 at 07:28

2 Answers2

3

I convert the 'Name' column to a list of 2 numbers and then sort the column and then return the index.

index = df['Name'].apply(lambda x: list(map(int,x.split('- Specimen')))).argsort().to_list()
df.iloc[index]
0

What you could look at doing is splitting the Name column, indexing out the two columns we want, casting them as integers and then performing a sort. You can then take the index of this subframe and use that to order your original dataframe.

sorted_index = df.Name.str.split(' ',expand = True).loc[:,[0,3]].astype(int).sort_values([0,3]).index

sorted_df = df.loc[sorted_index,:]
Sparrow0hawk
  • 517
  • 4
  • 13