Pandas Python groupby multiple columns - sort rows by values in column 2 based on column one having a specific value

Question

I have a data set where I have a few thousand records. The columns are acct number, code, data.

I need to sort the entire rows based on the code(column 2) if they have the same acct number(column1). so something like this.

 df = DataFrame({'Acct Number':[1011,1011,1011,1011,1012,1012,1012,1012,1012],
                 'Code':['GHI','JKL','ABC','DEF','DEF','MNO','JKL','GHI','ABC'],
                 'Data':['text3','text4','text1','text2','text2','text5','text4','text3','text1'])

I need to know how to make a dataframe that looks like this:

df = DataFrame({'Acct Number':[1011,1011,1011,1011,1012,1012,1012,1012,1012],
                'Code':['ABC','DEF','GHI','JKL','ABC','DEF','GHI','JKL','MNO'],
                'Data':['text1','text2','text3','text4','text1','text2','text3','text4','text5'])

THE CODE(column2) IS NOT IN ALPHABETICAL ORDER - They are random letters so I cannot use a basic sort to get those rows in order.

I am trying to look into groupby() with a set_index() Like in this question:

Question #66650446 - Pandas Groupby based on multiple columns

I am looking to try something like this.

import pandas as pd

THEFILE = 'original csv file'
THENEWFILE = 'the new csv file'

df = pd.read_csv('THEFILE')
df = df.set_index(['acct_number',df.groupby('code':'CFV','VFC','GTF','EDD','TGY'])
df = df.to_csv(THENEWFILE)

I am not sure the syntax in a groupby() to identify the specific order I want if it is set within an index that already sorts by a different column first.

All help is greatly appreciated!!

UPDATE: as of 3:28PM pacific time on 07/02/21 I found this question:

Question #23279238 - Custom Dictionary

I was trying to convey that the values in column2 are not in alphabetical order but I need to put them in a specific order. I think the custom dictionary will do it now that I am reading more @DavidErickson — LemNick, Jul 02 '21 at 22:37

score 2 · Accepted Answer · answered Jul 02 '21 at 22:35

The last question you linked made your question clearer. Create an Order column, sort by it, and drop it. You can determine the order in the dictionary.

df = pd.DataFrame({'Acct Number':[1011,1011,1011,1011,1012,1012,1012,1012,1012],
                 'Code':['GHI','JKL','ABC','DEF','DEF','MNO','JKL','GHI','ABC'],
                 'Data':['text3','text4','text1','text2','text2','text5','text4','text3','text1']})
dct = {'ABC' : 0, 'DEF' : 1, 'GHI' : 2, 'JKL' : 3, 'MNO' : 4}  
df['Order'] = df['Code'].map(dct)
df = df.sort_values(['Acct Number', 'Order']).drop('Order', axis=1)
df
Out[1]: 
   Acct Number Code   Data
2         1011  ABC  text1
3         1011  DEF  text2
0         1011  GHI  text3
1         1011  JKL  text4
8         1012  ABC  text1
4         1012  DEF  text2
7         1012  GHI  text3
6         1012  JKL  text4
5         1012  MNO  text5

score 0 · Answer 2 · answered Jul 02 '21 at 23:11

Assuming the Codes have an established order, and their ordered nature is going to need to be used or reused in future operations, creating a CategoricalDtype may be beneficial:

# Changed the order slightly so it's not alphabetical
cat_type = pd.CategoricalDtype(['DEF', 'ABC', 'GHI', 'MNO', 'JKL'],
                               ordered=True)
# Convert Code Column to new CategoricalDtype
df['Code'] = df['Code'].astype(cat_type)

The Categorical will now behave as expected with regards to any number of operations, not only will sort_values behave as expected without the need for an additional column:

df = df.sort_values(['Acct Number', 'Code'])

df:

   Acct Number Code   Data
3         1011  DEF  text2
2         1011  ABC  text1
0         1011  GHI  text3
1         1011  JKL  text4
4         1012  DEF  text2
8         1012  ABC  text1
7         1012  GHI  text3
5         1012  MNO  text5
6         1012  JKL  text4

So too will operations like min and max which will use the Categorical ordering to determine output:

df.groupby('Acct Number').agg({'Code': ['min', 'max']})

            Code     
             min  max
Acct Number          
1011         DEF  JKL
1012         DEF  JKL

Pandas Python groupby multiple columns - sort rows by values in column 2 based on column one having a specific value

2 Answers2