Sorting a pandas DataFrame by the order of a list

Question

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.

The list looks something like this:

class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']

This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?

Alex Riley · Accepted Answer · 2014-10-05T15:13:04.150

22

You could make the Class column your index column

df = df.set_index('Class')

and then use df.loc to reindex the DataFrame with class_list:

df.loc[class_list]

Minimal example:

>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
                 Class  Number
0  Gammaproteobacteria       3
1        Bacteroidetes       5
2        Negativicutes       6

>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
                     Number
Bacteroidetes             5
Negativicutes             6
Gammaproteobacteria       3

edited Oct 05 '14 at 15:13

answered Oct 05 '14 at 14:06

Alex Riley

169,130
45
262
238

4

For better generality, use `df = df.reindex(some_list)`, see [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html), for the reindexing step. While `DataFrame.loc[]` is primarily label based, it may also be used with a boolean array, as detailed [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). Therefore, if the indexes end up being boolean, and you try using `df = df.loc[[True,False]]` to re-index, you'll end up throwing out the second row. See https://stackoverflow.com/a/30010004/8508004. – Wayne Jul 10 '19 at 21:25

score 8 · Answer 2 · edited Sep 26 '21 at 14:07

Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:

ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']

df_list = []

for i in ordered_classes:
   df_list.append(df[df['Class']==i])

ordered_df = pd.concat(df_list)

Sorting a pandas DataFrame by the order of a list

2 Answers2

Linked

Related