3

I have following dataframe:

(Index)    sample    reads yeasts    
9          CO ref    10
10         CO raai   20
11         CO tus    30

I want to change the order of the columns based on sample, expected output:

(Index)    sample    reads yeasts    
9          CO ref    10
11         CO tus    30
10         CO raai   10

I'm not interested in the Index of the rows.

I've tried following code based on other stackoverflow/google posts:

df=df.reindex(["CO ref","CO tus","CO raai"])

This correctly changes the index, but all the other columns get value nan

I've also tried:

df.index=["CO ref","CO tus","CO raai"]  

This changes the index correctly but the other columns do not switch so it messes up the dataframe.

Also:

df["sample"].index=["CO ref","CO tus","CO raai"]   

But this does nothing.

How can I get this to work?

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Robvh
  • 1,191
  • 1
  • 11
  • 22

2 Answers2

4

For reindex is necessary create index from sample column:

df=df.set_index(['sample']).reindex(["CO ref","CO tus","CO raai"]).reset_index()

Or use ordered categorical:

cats = ["CO ref","CO tus","CO raai"]
df['sample'] = pd.CategoricalIndex(df['sample'], ordered=True, categories=cats)
df = df.sort_values('sample')
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • when executing ```df=fd.set_index('sample').reindex(["CO ref","CO tus","CO raai"]).reset_index()``` I get following error: ```TypeError: 'Int64Index' object is not callable```. Do you know how to fix this? – Robvh Jul 23 '19 at 09:56
  • @Robvh - Can you test with `[]` like `df=fd.set_index(['sample']).reindex(["CO ref","CO tus","CO raai"]).reset_index()` ? – jezrael Jul 23 '19 at 09:57
  • I get the same error. I assume here that your ```fd.set_index``` is a spelling mistake and should be ```df.set_index```. – Robvh Jul 23 '19 at 09:59
  • @Robvh - I think not - [`DataFrame.set_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html) – jezrael Jul 23 '19 at 10:00
  • it is about the "fd" and "df". Do you see the mistake? Your second code does the trick. Thanks! However it is rather cumbersome to write this in 3 seperate lines as I would rather use a single line. I still get the error when executing your first code. Any idea how to fix it? – Robvh Jul 23 '19 at 10:04
  • 1
    @Robvh - Hard question. If use `df=df.set_index(['sample']).reindex(["CO ref","CO tus","CO raai"]).reset_index()` still problem? – jezrael Jul 23 '19 at 10:07
  • 1
    I'm sorry, I made a mistake myself. I wrote ```df=df.index(['sample']).reindex(["CO ref","CO tus","CO raai"]).reset_index()``` instead of ```df=df.set_index(['sample']).reindex(["CO ref","CO tus","CO raai"]).reset_index()```. Both codes work now! Thank you! – Robvh Jul 23 '19 at 10:23
0

The solution from jezrael is of course correct, and most likely the fastest. But since this is really just a question of restructuring your dataframe I'd like to show you how you can easily do that and at the same time let your procedure select which subset of your sorting column to use.

The following very simple function will let you specify both the subset and order of your dataframe:

# function to subset and order a pandas
# dataframe of a long format
def order_df(df_input, order_by, order):
    df_output=pd.DataFrame()
    for var in order:    
        df_append=df_input[df_input[order_by]==var].copy()
        df_output = pd.concat([df_output, df_append])
    return(df_output)

Here's an example using the iris dataset from plotly express. df['species'].unique() will show you the order of that column:

Output:

array(['setosa', 'versicolor', 'virginica'], dtype=object)

Now, running the following complete snippet with the function above will give you a new specified order. No need for categorical variables or tampering of the index.

Complete code with datasample:

# imports
import pandas as pd
import plotly.express as px

# data
df = px.data.iris()

# function to subset and order a pandas
# dataframe fo a long format
def order_df(df_input, order_by, order):
    df_output=pd.DataFrame()
    for var in order:    
        df_append=df_input[df_input[order_by]==var].copy()
        df_output = pd.concat([df_output, df_append])
    return(df_output)

# data subsets
df_new = order_df(df_input = df, order_by='species', order=['virginica', 'setosa', 'versicolor'])
df_new['species'].unique()

Output:

array(['virginica', 'setosa', 'versicolor'], dtype=object)
vestland
  • 55,229
  • 37
  • 187
  • 305