9

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.

The list looks something like this:

class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']

This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?

matsjoyce
  • 5,744
  • 6
  • 31
  • 38
Wes Field
  • 3,291
  • 6
  • 23
  • 26

2 Answers2

22

You could make the Class column your index column

df = df.set_index('Class')

and then use df.loc to reindex the DataFrame with class_list:

df.loc[class_list]

Minimal example:

>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
                 Class  Number
0  Gammaproteobacteria       3
1        Bacteroidetes       5
2        Negativicutes       6

>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
                     Number
Bacteroidetes             5
Negativicutes             6
Gammaproteobacteria       3
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • 4
    For better generality, use `df = df.reindex(some_list)`, see [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html), for the reindexing step. While `DataFrame.loc[]` is primarily label based, it may also be used with a boolean array, as detailed [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). Therefore, if the indexes end up being boolean, and you try using `df = df.loc[[True,False]]` to re-index, you'll end up throwing out the second row. See https://stackoverflow.com/a/30010004/8508004. – Wayne Jul 10 '19 at 21:25
8

Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:

ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']

df_list = []

for i in ordered_classes:
   df_list.append(df[df['Class']==i])

ordered_df = pd.concat(df_list)
Prakash Dahal
  • 4,388
  • 2
  • 11
  • 25
jarvis
  • 983
  • 8
  • 5