0

I have two data frames,

df1

 Identifier    GSE1028888  GSE1034555
     100002           0.1         0.2
     100003           0.3         0.4
     ......         .....       .....
     100007           0.9         1.1

df2

         V3            V2
     100002         XLX12         
     100003         ABorF        
     ......         .....       
     110000         GEF22

Now I want insert the V2 information into df1, such as

df3

 Identifier        New_V2    GSE1028888  GSE1034555
     100002         XLX12           0.1         0.2
     100003         ABorF           0.3         0.4
     100004            NA           0.6         0.7
     ......         .....       .....
     100007         ccL34           0.9         1.1           

The V3 of df2 and Identifier of df1 have different length. I try dplyr left_join, but the column is attached at the end.

This is the code to create similar data frame

df1 <- data.frame("Identifier" = sample(100001:100010, 6, replace = F), 
               y = sample(rnorm(10), 6, replace = T), 
               z = sample(rnorm(10), 6, replace = T))
df2 <- data.frame(V1 = c(100001:100010), 
              V2 = sample(state.name, 10, replace = T))

This time when I try dplyr left_join,

left_join(df1, df2, by =c("Identifier"="V3"))

an error message was shown

Error: cannot join on columns 'V3' x 'Identifier': index out of bounds

Anyone has any idea?

pill45
  • 599
  • 3
  • 8
  • 23
  • Your question does not contain a [reproducible example](http://stackoverflow.com/q/5963269/4303162). It is therefore hard to understand your problem and give you an appropriate answer. Please make your data available (e.g. by using `dput()`) or use one of the example data sets in R. Also, add the minimal code required to reproduce your problem to your post. – Stibu Aug 19 '16 at 19:14

2 Answers2

4

One solution is to merge and reorder them (unfortunately not a one-function answer).

Setup

df1 = data.frame(i=c(1,2,3), GSE111=c(4,5,6), GSE222=c(7,8,9))
df2 = data.frame(i=c(1,3,4), v2=c(10,11,12))

This produces:

> df1
  i GSE111 GSE222
1 1      4      7
2 2      5      8
3 3      6      9
4 4      7     10

> df2
  i v2
1 1 10
2 3 11
3 4 12

Merge:

I've used built-in merge function instead of plyr.

df3 = merge(x=df1, y=df2, by="i", all.x=TRUE)

this will yield:

  i GSE111 GSE222 v2
1 1      4      7 10
2 2      5      8 NA
3 3      6      9 11

Reorder:

Now you can reorder by literally typing column names:

df4 = df3[,c("i", "v2", "GSE111", "GSE222")]

which yields

  i v2 GSE111 GSE222
1 1 10      4      7
2 2 NA      5      8
3 3 11      6      9

Which is essentially the product you wanted.



Shortcut:

Or, if you are lazy like me, I can generate unique column names between df1 and df2 like this:

unique(c(colnames(df2), colnames(df1)))

This yields:

[1] "i"      "v2"     "GSE111" "GSE222"

So you can type something like:

df3[,unique(c(colnames(df2), colnames(df1)))]
Shin
  • 123
  • 7
  • Thanks Shin. One problem, will unique() change the order of the column after i. Let say I want the order by GSE111, GSE333, GSE222, will it become GSE111, GSE222, GSE333 after I use unique()? For reorder, if df1 have many column, more than 100, is there an easier way to solve this problem? – pill45 Aug 19 '16 at 20:00
  • It's not documented, but unique() doesn't seem to reorder the elements in my experience. It just filters out subsequent duplicate entries. E.g., `unique(c('a','c','a','b'))` will output `"a" "c" "b"`. While I don't think it'll mess up your gene expression profile order, I'd always check my work. – Shin Aug 19 '16 at 20:08
  • @shin here's why your "shortcut" is unnecessary. `unique` has nothing to do with ordering variables. You get your desired result because the first data frame is df2 in that expression see. ``df3[,unique(c(colnames(df2), colnames(df1)))]`` But you don't need to even do that as you can see in my post if you switch the df1 and df2 in the original merger then all the other `reorder` , `unique` and indexing commands are completely unnecessary and may even lead to mistakes when dealing with complex data frames. – Cyrus Mohammadian Aug 20 '16 at 09:21
  • 1
    @CyrusMohammadian So the newer answer stripped the unnecessary part of the original answer and posted as your own. The important thing is that both answers are **correct**. The solution is trivial so it may not matter, but please refrain from doing similar thing in the future. – Shin Aug 21 '16 at 02:10
  • @Shin yes both answers are correct, if you notice people post different ways to get to the same answer. It's up to the OP to decide. Also note, the OP had additional questions (which they should have added to the original post) and I helped out on that. I assume thats why my answer was accepted. The point of this forum is to help folk out not to compare rep. I've up voted your post as well bc it provides additional useful info. – Cyrus Mohammadian Aug 21 '16 at 04:53
1

Using the data you provided:

df1 <- data.frame("Identifier" = sample(100001:100010, 6, replace = F), 
               y = sample(rnorm(10), 6, replace = T), 
               z = sample(rnorm(10), 6, replace = T))
df2 <- data.frame(V1 = c(100001:100010), 
              V2 = sample(state.name, 10, replace = T))

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE)
       V1            V2          y          z
1  100001     Wisconsin -1.9468637  0.4509951
2  100002      Nebraska  2.5155819  0.4509951
3  100003          Ohio         NA         NA
4  100004 Massachusetts         NA         NA
5  100005       Montana -1.9468637 -2.1825878
6  100006      Illinois -0.1591367  0.3445637
7  100007    New Mexico -0.5696300  0.4509951
8  100008    New Jersey         NA         NA
9  100009     Tennessee         NA         NA
10 100010    Washington -1.9468637 -0.5402241

And you can toggle the order which ever way you like but the above output is what you requested -no need to reorder but if wanted to:

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE)[c(2,3,4,1)]
              V2          y          z     V1
1      Wisconsin -1.9468637  0.4509951 100001
2       Nebraska  2.5155819  0.4509951 100002
3           Ohio         NA         NA 100003
4  Massachusetts         NA         NA 100004
5        Montana -1.9468637 -2.1825878 100005
6       Illinois -0.1591367  0.3445637 100006
7     New Mexico -0.5696300  0.4509951 100007
8     New Jersey         NA         NA 100008
9      Tennessee         NA         NA 100009
10    Washington -1.9468637 -0.5402241 100010

As per the OP's posted comments below, here is an example where an identifier exists in df1 that does not exist in df2

df1 <- data.frame("Identifier" = sample(100001:100012, 6, replace = F), 
               y = sample(rnorm(10), 6, replace = T), 
               z = sample(rnorm(10), 6, replace = T))

df1
  Identifier           y            z
1     100011 -1.60532712  1.365836073
2     100007 -1.28821500  0.005925986
3     100004 -0.03444609  0.780708952
4     100006  0.32190045  0.780708952
5     100009 -1.60532712 -1.471916384
6     100005 -0.76985033  0.191956916

df2 <- data.frame(V1 = c(100001:100010), 
              V2 = sample(state.name, 10, replace = T))
df2
       V1            V2
1  100001  Pennsylvania
2  100002 West Virginia
3  100003          Utah
4  100004        Alaska
5  100005          Ohio
6  100006   Mississippi
7  100007 New Hampshire
8  100008    New Jersey
9  100009          Ohio
10 100010       Georgia

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE, all.y=TRUE)
       V1            V2           y            z
1  100001  Pennsylvania          NA           NA
2  100002 West Virginia          NA           NA
3  100003          Utah          NA           NA
4  100004        Alaska -0.03444609  0.780708952
5  100005          Ohio -0.76985033  0.191956916
6  100006   Mississippi  0.32190045  0.780708952
7  100007 New Hampshire -1.28821500  0.005925986
8  100008    New Jersey          NA           NA
9  100009          Ohio -1.60532712 -1.471916384
10 100010       Georgia          NA           NA
11 100011          <NA> -1.60532712  1.365836073
Cyrus Mohammadian
  • 4,982
  • 6
  • 33
  • 62
  • Thanks CM,but Identifier in df1 may have some variable that is not V1, although this condition is not seen in the dataset I created. So, the way you match may some how eliminate those variable that is not found in V1, and I don't want this happen. But thanks for your idea to toggle the order of the data. – pill45 Aug 19 '16 at 20:22
  • Great! If I want to remove NA in column y or column z but not V2, can I make it with merge() or I have to do it separately? – pill45 Aug 19 '16 at 21:03
  • @pill you can do it separately as `dplyr::filter(data, y!="NA")` . If you like this answer consider accepting it so that your post is marked solved. – Cyrus Mohammadian Aug 20 '16 at 00:11