How to sort and order a dataframe by the similarity of its rows

Question

df
         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway3     A         G           NA         NA            F
Pathway6     A         G           NA         NA            E
Pathway1     A         B           C          D             F
Pathway2     A         B           H          NA            F
Pathway4     A         B           C          D             E
Pathway5     A         B           H          NA            F

I would like to re-order the above dataframe (df) so that the pathways that share the greatest similarity in their proteins pathways (aka the greatest similarity in columns 2:4) are sorted next to each other.

To be more clear, I would like the output to look like this:

newdf
         Beginning1 Protein2    Protein3    Protein4    Biomarker1
Pathway6     A         G           NA         NA            E
Pathway3     A         G           NA         NA            F
Pathway5     A         B           H          NA            E
Pathway2     A         B           H          NA            F
Pathway4     A         B           C          D             E
Pathway1     A         B           C          D             F

How would one go about doing that? I've tried variations including unique(df), but none have worked so far.

Also, while just ordering by the amount of non-NA characters would work for this dataset, the actual dataset I will be analyzing will have hundreds of pathways with the same amount of steps.

Don't post pictures of data. Keep your data in a [reproducible format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — MrFlick, Jun 28 '17 at 14:41
Thank you! I am new to Stack Overflow and wasn't sure how to input my dataframe in the question. — Taylor Maurer, Jun 28 '17 at 14:43
Though not it won't work in all cases, you could use the base R `order` function to sort the data: `df[with(df, order(Beginning1, Protein2, Protein3, Protein4)),]` for example. — lmo, Jun 28 '17 at 15:19

Artem Sokolov · Answer 1 · 2017-06-28T15:13:08.270

Use arrange from the dplyr package. It will sort the data frame based on one or more columns. You can use desc to sort in descending order, as requested in your post:

> dplyr::arrange( df, desc(Protein2), desc(Protein3), desc(Protein4) )

   Beginning1 Protein2 Protein3 Protein4 Biomarker1
 1          A        G     <NA>     <NA>          F
 2          A        G     <NA>     <NA>          E
 3          A        B        H     <NA>          F
 4          A        B        H     <NA>          F
 5          A        B        C        D          F
 6          A        B        C        D          E

Note that dplyr operations do not preserve rownames, as they follow Hadley Wickham's Tidy data definition (In brief, rownames are undesirable because R expects them to be unique). You can use rownames_to_column from tibble package to keep track of your pathway identifiers:

> tibble::rownames_to_column( df, "Pathway" ) %>% 
       dplyr::arrange( desc(Protein2), desc(Protein3), desc(Protein4 ) )

    Pathway Beginning1 Protein2 Protein3 Protein4 Biomarker1
 1 Pathway3          A        G     <NA>     <NA>          F
 2 Pathway6          A        G     <NA>     <NA>          E
 3 Pathway2          A        B        H     <NA>          F
 4 Pathway5          A        B        H     <NA>          F
 5 Pathway1          A        B        C        D          F
 6 Pathway4          A        B        C        D          E

There's an equivalent tibble::column_to_rownames if you need to put the rownames back, but it is generally advisable not to.

Thank you! That works with my smaller dataset. I will try with my larger, more complicated dataset and get back to you! — Taylor Maurer, Jun 28 '17 at 15:21

BENY · Answer 2 · 2017-06-28T15:07:34.267

Try this (Btw: In column Biomarker1, you input and output are miss matched , I corrected input df base on my understanding to get your desired output. )

df[is.na(df)]=''
df$ALL <- do.call(paste0, df[,2:4])
df=df[order(rev(df$ALL),rev(df$Biomarker1)),]
df[df=='']=NA
df$ALL=NULL
         Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway6          A        G     <NA>     <NA>          E
Pathway3          A        G     <NA>     <NA>          F
Pathway2          A        B        H     <NA>          E
Pathway5          A        B        H     <NA>          F
Pathway4          A        B        C        D          E
Pathway1          A        B        C        D          F

Input

df

**

#             Beginning1 Protein2 Protein3 Protein4 Biomarker1
#    Pathway3          A        G     <NA>     <NA>          F
#    Pathway6          A        G     <NA>     <NA>          E
#    Pathway1          A        B        C        D          F
#    Pathway2          A        B        H     <NA>          E
#    Pathway4          A        B        C        D          E
#    Pathway5          A        B        H     <NA>          F

**

Both of your solutions work. They give slightly different outputs with my larger dataframe, but both outputs increase the dataframes organization, which was what I was going for. — Taylor Maurer, Jun 28 '17 at 16:07

How to sort and order a dataframe by the similarity of its rows

2 Answers2