Sort pairs of columns in a dataframe, based on another customly ordered column

Question

I have searched through web but I could not find a solution to my problem. I have a data like below

df <- structure(list(V1 = c("ATP1A2", "CAPRIN1", "ATP1A1", "CBX3", 
"AUP1", "LARS2", "MTHFD1", "VDAC2", "PRKCSH", "ATP1B1", "B3GNT3", 
"", ""), V2 = c("ATP1A1", "ATP1A2", "ATP1B1", "AUP1", "B3GNT3", 
"CAPRIN1", "CAPRIN1", "CBX3", "", "", "", "", ""), V3 = c("220948_s_at", 
"203296_s_at", "201243_s_at", "220525_s_at", "204856_at", "200722_s_at", 
"200723_s_at", "200037_s_at", "", "", "", "", ""), V4 = c("LARS2", 
"MTHFD1", "PRKCSH", "PRKCSH", "VDAC2", "", "", "", "", "", "", 
"", ""), V5 = c("204016_at", "202309_at", "200707_at", "214080_x_at", 
"211662_s_at", "", "", "", "", "", "", "", "")), .Names = c("V1", 
"V2", "V3", "V4", "V5"), row.names = c(NA, -13L), class = "data.frame")

what I am trying to do is to sort the column 2 and 4 based on first column. however if I sort the second column, the third column will change according to the V2 and fifth column according to the V4.

Expected output is like below

       V1      V2          V3          V4          V5
1   ATP1A2    ATP1A2   203296_s_at      -           -
2  CAPRIN1    CAPRIN1  200722_s_at      -           - 
3      -      CAPRIN1  200723_s_at      -           - 
4   ATP1A1    ATP1A1   220948_s_at      -           -
5    CBX3      CBX3    200037_s_at      -           -
6     AUP1     AUP1    220525_s_at      -           - 
7    LARS2      -          -          LARS2    204016_at
8   MTHFD1      -           -         MTHFD1   202309_at
9    VDAC2      -           -         VDAC2    211662_s_at 
10   PRKCSH      -           -         PRKCSH   200707_at   
11      -         -           -         PRKCSH   214080_x_at                
12  ATP1B1     ATP1B1  201243_s_at        -        -                             
13  B3GNT3    B3GNT3    204856_at         -        -

I want to sort the V2 and V4 based on V1. Note that V3 corresponds to V2 and V5 corresponds to V4. for example if V2 changes V3 changes and so on.

I did try the following which did not help me

df2<- df[with(df, order(V1)), ]

I tried the following too but did not work

require(data.table)
df2 <- data.table(df, key="V1")

The following as well did not work

df2<- df[order(df$V1),]

Would seem easier working with separate vectors and then binding them in a dataframe. — Dominic Comtois, Mar 11 '15 at 23:36
@Dominic Comtois I will be very happy to see your solution :-) — , Mar 11 '15 at 23:40
Sorry if I'm being dense, but I'm having trouble resolving your example with your question. None of the columns are sorted in their own order. To start with, in your example, how does column V1 get sorted into that order? — Rick, Mar 12 '15 at 00:13
@Rick the V1 does not need to be sorted but other columns will be sorted based on V1. For example, Lets assume first element of V1 is "a" then we check the second column to find "a" and its corresponding ID in the V3 and put it in front of the "a" in the V1. the same for the other column etc — , Mar 12 '15 at 00:15
Ok but sorted based on V1 as in the "expected output" table, or on V1 as in the dput on top of the question? — Dominic Comtois, Mar 12 '15 at 00:19
@Dominic Comtois the raw data is the dput one. As I wrote I expect to have an output like the table one — , Mar 12 '15 at 00:20
Something doesn't add up -- in V1 as defined at the top, the empty string comes last. So empty strings in V2 would come last as well if using V1 as a basis for sorting... — Dominic Comtois, Mar 12 '15 at 00:27
@Dominic Comtois not exactly because sometimes some of the elements are repeated several times for example in V1 we only have one "a" but in V2 we have two "a" with different V3 elements and in V4 we have 4 "a" with different V5 elements. that is why we have some empty string in the V1 as well — , Mar 12 '15 at 00:35
Oh I think I see... you consider A2 and A4 as if they were 1 — Dominic Comtois, Mar 12 '15 at 00:35
Can someone please rewrite the title for clarity, my head is exploding. Does it mean **"independently sort the two columns"** or **"sort them based on some joint ordering"** or what? — smci, Mar 12 '15 at 02:44
@smci : hopefully it's better now. Otherwise there's always Tylenol. :^p — Dominic Comtois, Mar 12 '15 at 04:55

zx8754 · Answer 1 · 2015-03-12T08:55:24.663

This doesn't give expected output, but it makes more sense in my opinion:

#get order
df$rn <- 1:nrow(df)

#merge twice
x <- merge(subset(df,V1!="",select=1),
           subset(df,V2!="",select=c(2:3)),
           by.x="V1",by.y="V2",all=TRUE)
res <- merge(x,
             subset(df,V4!="",select=c(4:5)),
             by.x="V1",by.y="V4",all=TRUE)

#merge rownumber to order as `df`
res <- merge(subset(df,select=c(1,6)),res,by="V1")
res <- res[ order(res$rn),]

#add pretty colnames
colnames(res) <- c("Gene","RowNum","Probe1","Probe2")

#output:
res
#       Gene RowNum      Probe1      Probe2
# 2   ATP1A2      1 203296_s_at        <NA>
# 6  CAPRIN1      2 200723_s_at        <NA>
# 7  CAPRIN1      2 200722_s_at        <NA>
# 1   ATP1A1      3 220948_s_at        <NA>
# 8     CBX3      4 200037_s_at        <NA>
# 4     AUP1      5 220525_s_at        <NA>
# 9    LARS2      6        <NA>   204016_at
# 10  MTHFD1      7        <NA>   202309_at
# 13   VDAC2      8        <NA> 211662_s_at
# 11  PRKCSH      9        <NA>   200707_at
# 12  PRKCSH      9        <NA> 214080_x_at
# 3   ATP1B1     10 201243_s_at        <NA>
# 5   B3GNT3     11   204856_at        <NA>

In expected output, V1, V2, V4 are Gene names, which is in my output merged into one column V1=Gene column.

you removed the second and fourth columns

To get 2nd or 4th column:

require(dplyr)
res %>% filter(!is.na(res$Probe1)) %>% select(Gene) %>% distinct
res %>% filter(!is.na(res$Probe2)) %>% select(Gene) %>% distinct

thanks, I liked your solution. I will wait to see what others suggest — , Mar 12 '15 at 00:16
@Nemo how did you get to your initial dataset, and what do you intend to do with "your expected output" next? This might help us understand the problem better. — zx8754, Mar 12 '15 at 06:39
those genes are up regulated , what you did you sort the first one two and you removed the second and fourth columns. so I cannot use it now to check for GSEA, what are the second and the fourth ? they are simply affymetrix probes ID which are obtained by different method. I am checking whether I get the probes with different methods or not. if no, which genes give me different probes or which genes does not have probes in a specific database . I hope I was clear enough since I know you understand bioinformatics — , Mar 12 '15 at 08:16
thanks for your message, I get an error like > res <- res[order(res$rn),] Error in .Method(..., na.last = na.last, decreasing = decreasing) : argument 1 is not a vector — , Mar 12 '15 at 09:26

Dominic Comtois · Answer 2 · 2015-03-12T02:21:09.827

Ok, there's probably a more elegant solution to this, but there you go (note that I removed empty strings from V1. I left them in for V2 to V5 but they could also have been left out.):

V1 <- c("ATP1A2", "CAPRIN1", "ATP1A1", "CBX3", "AUP1", "LARS2", "MTHFD1", "VDAC2", "PRKCSH", "ATP1B1", "B3GNT3")
V2 <-  c("ATP1A1", "ATP1A2", "ATP1B1", "AUP1", "B3GNT3", "CAPRIN1", "CAPRIN1", "CBX3", "", "", "", "", "")
V3 <-  c("220948_s_at", "203296_s_at", "201243_s_at", "220525_s_at", "204856_at", "200722_s_at", "200723_s_at", "200037_s_at", "", "", "", "", "")
V4 <- c("LARS2", "MTHFD1", "PRKCSH", "PRKCSH", "VDAC2", "", "", "", "", "", "", "", "")
V5 <- c("204016_at", "202309_at", "200707_at", "214080_x_at", "211662_s_at", "", "", "", "", "", "", "", "")


V2.final <- c()
V3.final <- c()
V4.final <- c()
V5.final <- c()

for(i in seq_along(unique(V1))) {
  if(V1[i] %in% V2) {
    V2.final <- append(V2.final, V2[which(V2==V1[i])])
    V3.final <- append(V3.final, V3[which(V2==V1[i])])
    V4.final <- append(V4.final, rep("", length(which(V2==V1[i]))))
    V5.final <- append(V5.final, rep("", length(which(V2==V1[i]))))
  } else if(V1[i] %in% V4) {
    V2.final <- append(V2.final, rep("", length(which(V4==V1[i]))))
    V3.final <- append(V3.final, rep("", length(which(V4==V1[i]))))
    V4.final <- append(V4.final, V4[which(V4==V1[i])])
    V5.final <- append(V5.final, V5[which(V4==V1[i])])
  }
}

cbind(V2.final, V3.final, V4.final, V5.final)

      V2.final  V3.final      V4.final V5.final     
 [1,] "ATP1A2"  "203296_s_at" ""       ""           
 [2,] "CAPRIN1" "200722_s_at" ""       ""           
 [3,] "CAPRIN1" "200723_s_at" ""       ""           
 [4,] "ATP1A1"  "220948_s_at" ""       ""           
 [5,] "CBX3"    "200037_s_at" ""       ""           
 [6,] "AUP1"    "220525_s_at" ""       ""           
 [7,] ""        ""            "LARS2"  "204016_at"  
 [8,] ""        ""            "MTHFD1" "202309_at"  
 [9,] ""        ""            "VDAC2"  "211662_s_at"
[10,] ""        ""            "PRKCSH" "200707_at"  
[11,] ""        ""            "PRKCSH" "214080_x_at"
[12,] "ATP1B1"  "201243_s_at" ""       ""           
[13,] "B3GNT3"  "204856_at"   ""       ""

Replace "else if" by "if" if the same string can be in both V2 and V4 — Dominic Comtois, Mar 12 '15 at 01:04
thanks for your solution. However it is not a handy one but I liked it since you put some effort :-) Thanks again — , Mar 12 '15 at 09:27

cmbarbu · Accepted Answer · 2015-03-12T15:46:18.573

I agree with zx8754 what you want to do is a merge (join in sql) on V1 of (V2,V3) and (V4,V5), which can be done more simply if your data were initially in separated database in the format:

    df$RowNum <- 1:nrow(df) # row numbers to sort at the end

1) using data.table, potentially useful if you have huge tables

    library("data.table")
    dt <- as.data.table(df)

    # your data are essentially three different tables
    # so let's split it up, removing useless empty cells
    ref <- dt[which(dt$V1!=""),c("V1","RowNum"),with=FALSE]
          # with=FALSE necessary for the second argument to 
          #    be understood as column names in a data table
    setkey(ref,"V1") # the column used for the merges

    tab1 <- dt[which(dt$V2!=""),c("V2","V3"),with=FALSE]
    setkey(tab1,"V2")

    tab2 <- dt[which(dt$V4!=""),c("V4","V5"),with=FALSE]
    setkey(tab2,"V4")

    # merge tab1 to ref and tab2 to the product
    # using data.table formalism
    df3 <- tab2[tab1[ref,allow.cartesian=T],allow.cartesian=T]
             # allow.cartesio=T important to keep all in ref

    # and to get exactly the same output
    setkey(df3,"RowNum") # order df3 by RowNum
    df3 <- within(df3,{
                  V2 <- V1 <- V4 # make the columns V1 and V2
                  V4[is.na(V5)]<-NA # put back NA
                  V2[is.na(V3)]<-NA
    })
    setcolorder(df3,sort(names(df3))) # sort V1 to V5

2) using base R # same as above, we split df in three tables ref <- df[which(df$V1!=""),c("V1","RowNum")] tab1 <- df[which(df$V2!=""),c("V2","V3")] tab2 <- df[which(df$V4!=""),c("V4","V5")]

Then you merge:

    df2 <- merge(ref,tab1,by.x="V1",by.y="V2",all=TRUE)
    df2 <- merge(df2,tab2,by.x="V1",by.y="V4",all=TRUE)

After if you want exactly the same output:

    df2 <- df2[order(df2$RowNum),] # order by RowNum
    # make the V2 column
    df2$V2 <- df2$V1
    df2$V2[which(is.na(df2$V3))] <- NA

    # make the V4 column
    df2$V4 <- df2$V1
    df2$V4[which(is.na(df2$V5))] <- NA

    # order the columns as wanted
    df2 <- df2[,c("V1","V2","V3","V4","V5")]

thank you I liked your solution already. I will wait to see if there is anybody else making new strategy or not, if not then I accept your solution. Thanks again. If it is possible please comment on each line , this helps me to learn and understand what you did (of course if you can) — , Mar 12 '15 at 14:46
Added comments, let me know if anything is unclear. A good ressource on merges is this ressource: http://stackoverflow.com/a/9652931/1174052 — cmbarbu, Mar 12 '15 at 15:50
actually i added two more columns in the data then I made the "tab3" and then df4 <- tab3[tab2[tab1[ref,allow.cartesian=T],allow.cartesian=T],allow.cartesian=T] then setkey(df4,"RowNum") — , Mar 12 '15 at 18:16
then i did df4 <- within(df4,{ V2 <- V1 <- V4 <-V6 V4[is.na(V5)]<-NA V2[is.na(V3)]<-NA V6[is.na(V7)]<-NA }) setcolorder(df4,sort(names(df4))) do you think it is correct ? — , Mar 12 '15 at 18:17
It should be fine. Try it on a small sample and check carefully what you get on the final dataset. — cmbarbu, Mar 12 '15 at 20:32
I checked it and it was fine. Only problem was the last line "setcolorder(df4,sort(names(df4)))" in fact, the name of columns are not V1, V2,V3 etc , so when i do that, it messes up . I tried to fix it with d4<- df4[,c(7,6,5,4,3,2,1)] but it gave me an empty char — , Mar 12 '15 at 22:19
@Nemo this construct cannot be used with data table. You need to use setcolorder(df3,vectorOfColumnNamesInRightOrder) — cmbarbu, Mar 12 '15 at 22:22

Sort pairs of columns in a dataframe, based on another customly ordered column

3 Answers3