13

Given this DataFrame:

df = pd.DataFrame([['August', 2], ['July', 3], ['Sept', 6]], columns=['A', 'B'])

I would like to sort column A in this order: July, August, Sept. Is there some way to use a sort function like "sort_values" but pre-define the sort order by values?

rafaelc
  • 57,686
  • 15
  • 58
  • 82
sparrow
  • 10,794
  • 12
  • 54
  • 74

7 Answers7

16

Using Categorical

df.A=pd.Categorical(df.A,categories=['July', 'August', 'Sept'])
df=df.sort_values('A')
df
Out[310]: 
        A  B
1    July  3
0  August  2
2    Sept  6
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thanks, can you define the column categorically after it exists? – sparrow Oct 12 '18 at 17:50
  • @sparrow sorry what you mean after it exists – BENY Oct 12 '18 at 17:52
  • I mean, can you create the dataframe first and then define the categories for a column in the df? – sparrow Oct 12 '18 at 17:53
  • @sparrow yes , you can – BENY Oct 12 '18 at 17:53
  • `df.A=df.A.astype('category')` @sparrow ` pd.DataFrame(data=df.A, dtype='category')` – BENY Oct 12 '18 at 17:55
  • Thanks, how do you define what the categories are after changing the dtype? – sparrow Oct 12 '18 at 17:57
  • @sparrow unfortunately, you need to define it one by one in a list and passed to `categories` – BENY Oct 12 '18 at 18:00
  • Though it's important to note that if you want to customize sort order, you will have to build up a custom iterable either way. This is probably the most convenient method. Not to mention the fastest. – PMende Oct 12 '18 at 18:14
  • 1
    It worked when I passed them in this way: df['A'] = df['A'].astype(pd.api.types.CategoricalDtype(categories=['July','August','Sept'])), https://stackoverflow.com/questions/47537823/futurewarning-specifying-categories-or-ordered-in-astype-is-deprecated – sparrow Oct 12 '18 at 18:51
8

Define the order in a dictionary and sort according to it

sort_dict = {'July':0,'August':1,'Sept':2}
df.iloc[df['A'].map(sort_dict).sort_values().index]

Output

       A    B
1   July    3
0   August  2
2   Sept    6
Yuca
  • 6,010
  • 3
  • 22
  • 42
4

Are you opposed to using either complete month names or consistent abbreviations?

df = pd.DataFrame([['August', 2], ['July', 3], ['Sept', 6]], columns=['A', 'B'])

df

import calendar

df = df.replace({'Sept':'September'})

calendar.month_name[1:]

Output:

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

df['A'] = pd.Categorical(df.A, categories=calendar.month_name[1:], ordered=True)

df.sort_values('A')

Output:

           A  B
1       July  3
0     August  2
2  September  6

Or use calendar.month_abbr

calendar.month_abbr[1:]

Output:

['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec']
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
3

since pandas version 1.1.0, sort_values support sort by key.

df = df.sort_values('A', key=lambda s: s.apply(['July', 'August', 'Sept'].index), ignore_index=True)
  • This is the canonical and most generalizable way to sort in a specific order. The requirement that the key function be vectorized is a bit ridiculous, and forces you to use this non-trivial `s.apply` construction, and there's no example provided in the docs either! – Praveen Jul 25 '23 at 19:56
1

You can assign your own values for sorting the column by, sort by those, then drop them:

df = pd.DataFrame([['August', 2], ['July', 3], ['Sept', 6]], columns=['A', 'B'])
value_map = {'August': 1, 'July': 0, 'Sept': 2}
def sort_by_key(df, col, value_map):
    df = df.assign(sort = lambda df: df[col].map(value_map))
    return df.sort_values('sort') \
             .drop('sort', axis='columns')

sort_by_key(df, 'A', value_map)

Results in:

        A  B
1    July  3
0  August  2
2    Sept  6
PMende
  • 5,171
  • 2
  • 19
  • 26
1

Temporarily convert the str month to datetime and sort

df = pd.DataFrame([['August', 2], ['July', 3], ['Sept', 6]], columns=['A', 'B'])
df['tmp'] = pd.to_datetime(df['A'].str[:3], format='%b').dt.month
df.sort_values(by = ['tmp']).drop('tmp', 1)


    A       B
1   July    3
0   August  2
2   Sept    6
Vaishali
  • 37,545
  • 5
  • 58
  • 86
1

I changed your 'Sept' to 'September' to keep it consistent with the other months' naming convention.

Then I made an ordered list of month names with pd.date_range.

Subdivided the list by the values you had (keeps the correct month-order).

Made a categorical using that sublist, and then sorted on those values

import pandas as pd


df = pd.DataFrame([['August', 2], ['July', 3], ['September', 6]], columns=['A', 'B'])

full_month_list = pd.date_range('2018-01-01','2019-01-01', freq='MS').strftime("%B").tolist()
partial_month_list = [x for x in month_list if x in df['A'].values]
df['A'] = pd.Categorical(df['A'], partial_month_list)

df.sort_values('A')

Results in:

    A           B
1   July        3
0   August      2
2   September   6
Declan
  • 574
  • 4
  • 15