Having trouble sorting the Cabin values of the Titanic Dataset properly

Question

So I have a series of Cabin values; the left is the index and the right column holds the Cabin values. After using the sort_values method, I was only able to partially sort the values.

x =  Cabin_Fare=Cabin_Fare.sort_values(['Cabin' ]) 

210      A31
186      A32
446      A34
1185     A34
1266     A34
807      A36
97       A 
24       A6 
175      A7 
1058     B10
738     B101
816     B102
1107     B11
330      B18
524      B18
171      B19
691      B20
660      D48
682      D49
626      D50
22       D56
783      D6 
276      D7 
628      D9 
430      E10
718      E101
304      E101
124      E101
461      E12
752      E121
1234     NaN
1252     NaN
1257     NaN
73       NaN
121      NaN

The problem I'm having　is despite being able to sort the cabin letters, I'm having trouble sorting by the numbers attached to the Cabin letters. So my desired output is

97       A 
24       A6 
175      A7 
210      A31
186      A32
446      A34
1185     A34
1266     A34
807      A36
1058     B10
1107     B11
330      B18
524      B18
171      B19
691      B20
738     B101
816     B102
........

1234     NaN
1252     NaN
1257     NaN
73       NaN
121      NaN

I'm not particular about the NaN values, but I would like them at the end of the series. The lone Cabin values such as the lone "A" could have a "0" added to it if necessary, but I want the letters without numbers attached to them, to be first on the list.

I was given a few ideas but It turns out this code(below), messes with the order of the letters. I want to retain the alphabetical order.

 x.reindex(x[x.notnull()].str[1:].replace('', 0).astype(int).sort_values().index)

Thanks.

possible dupe: http://stackoverflow.com/questions/29580978/naturally-sorting-pandas-dataframe — EdChum, Oct 25 '16 at 20:57

piRSquared · Accepted Answer · 2016-10-25T21:16:38.670

3

# setup regex for str.extract
# ?P<letter> tells pandas to make that a column with name 'letter'
regex = '(?P<letter>\D+)(?P<digit>\d*)'
# easy access to column names I'm making in extract step
cols = ['letter', 'digit']

# run extract.  will pull out letter and digit
split_df = df.Cabin.str.extract(regex, expand=True)
# make sure digit column is numeric and fill with 0
split_df['digit'] = pd.to_numeric(split_df['digit'], 'coerce').fillna(0)
# sort by cols gets us the right sort
split_df.sort_values(cols, inplace=True)
# use sorted split_df.index for a slice
df = df.ix[split_df.index]
df.head(20)

edited Oct 25 '16 at 21:16

answered Oct 25 '16 at 21:05

piRSquared

285,575
57
475
624

Wow this is very clever. Just a question on this line of code: `df = df.ix[split_df.index]` Is this an implicit way of sorting the original df based on the index of the sorted split_df? That's very clever. I assumed you were going to somehow concatenate the values of the two columns of split_df and then replace the original "Cabin" column with the concatenated column. Thank you. – Moondra Oct 26 '16 at 19:06
@moondra yes! It's a way to sort. – piRSquared Oct 26 '16 at 19:10
It seems to be working perfectly. I thoroughly checked the dataframe for possible sorting errors, and I didn't see any. Thank you. – Moondra Oct 26 '16 at 22:01

score 2 · Answer 2 · answered Oct 25 '16 at 22:02

2

You can easily split it into letter and numbers:

letter, numbers = cabin[0], cabin[1:]

answered Oct 25 '16 at 22:02

simon

2,561
16
26

might not be the case in OP's problem, but this is limited to the letter component being one character. but plus one, because it will certainly be quicker. You should show the entire answer and not just the clever solution to one part. – piRSquared Oct 25 '16 at 22:51

Having trouble sorting the Cabin values of the Titanic Dataset properly

2 Answers2