1

I am manipulating my genomic data using R but I am encountering some problems. Although I could solve my problems on my own, I think that there is a more efficient way to solve it.

I have three matrices with two columns indicating that one is a gene name and the other is cancer information, and I want to combine them into one data frame.

Here are my matrices:

result0
tp53   c1
apc    c2

result1
tp53   d1
col2a1 d2

result2
tp53   e1
wt1    e2

and what I want to do is to combine the three matrices into one by adding two columns as shown in figure below.

combined result
tp53   c1 d1 e1
apc    c2
col2a1 d2
wt1    e2

By combining duplicated rows into a single row and adding two additional columns, I can merge different data sets into a new one containing all the results. How can I do it by using R language? I need to solve this problem on matrices with a large number of rows.

josliber
  • 43,891
  • 12
  • 98
  • 133
이승철
  • 37
  • 1
  • 7
  • read _in detail_ the help page `?merge.data.frame` – RockScience Apr 23 '15 at 10:07
  • [How to join data frames in R (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right/) – zx8754 Apr 23 '15 at 10:10

1 Answers1

0

The merge() function only takes two arguments to merge. Since you have three matrices, you have to call Reduce() to cumulatively merge:

m1 <- matrix(c('tp53','apc','c1','c2'),2);
m2 <- matrix(c('tp53','col2a1','d1','d2'),2);
m3 <- matrix(c('tp53','wt1','e1','e2'),2);
m <- Reduce(function(x,y) merge(x,y,1,all=T),list(m1,m2,m3));
m;
##       V1 V2.x V2.y   V2
## 1    apc   c2 <NA> <NA>
## 2   tp53   c1   d1   e1
## 3 col2a1 <NA>   d2 <NA>
## 4    wt1 <NA> <NA>   e2

It is not the design of merge() to combine non-key columns, thus, as you can see, the c1/c2/d1/d2/e1/e2 values are still separated into separate (non-leftmost) columns in the merged object. You can solve this with another line of code (or you could combine the two lines into one, since m is used only once on the RHS of this second line of code):

m <- as.data.frame(t(apply(m,1,function(x) na.omit(x)[1:length(x)])));
m;
##       V1 V2   V3   V4
## 1    apc c2 <NA> <NA>
## 2   tp53 c1   d1   e1
## 3 col2a1 d2 <NA> <NA>
## 4    wt1 e2 <NA> <NA>

You may notice that the row order of m does not follow the order in which key values occurred in the input matrices. I'm not sure exactly why this happened; it appears that merge() can place unmatched rows (e.g. apc) before matched rows (e.g. tp53). A guaranteed row order is not part of the contract of merge(). In any case, you can fix this with the following (row names can be fixed up afterward as well, if necessary, via row.names()/rownames()/dimnames()):

m[match(m[,1],unique(c(m1[,1],m2[,1],m3[,1]))),];
##       V1 V2   V3   V4
## 2   tp53 c1   d1   e1
## 1    apc c2 <NA> <NA>
## 3 col2a1 d2 <NA> <NA>
## 4    wt1 e2 <NA> <NA>

Notes:

  • I haven't bothered messing with column names anywhere, since you haven't specified column names in your question. If necessary, you can set column names after-the-fact using a call to names()/setNames()/colnames()/dimnames().
  • Funnily enough, although merge() accepts matrix inputs, it always spits out a data.frame, and although apply() accepts data.frame inputs, it always spits out a matrix. I've added a final call to as.data.frame() in the second line of code because you've specified you want a data.frame as the output, but you can remove that call to get a matrix as the final result.
bgoldst
  • 34,190
  • 6
  • 38
  • 64