Merging data frames with missing values in R

Question

Code to get the data frames:

rat_all = structure(list(frequency = c(37L, 31L, 14L, 11L, 2L, 3L), isoforms = 8:13,      
    type = structure(c("rat_all", "rat_all", "rat_all", "rat_all",              
    "rat_all", "rat_all"), .Dim = c(6L, 1L))), .Names = c("frequency",          
"isoforms", "type"), row.names = 8:13, class = "data.frame")

rat_ensembl = structure(list(frequency = c(17L, 8L, 20L), isoforms = 8:10,                    
    type = structure(c("rat_ensembl", "rat_ensembl", "rat_ensembl"              
    ), .Dim = c(3L, 1L))), .Names = c("frequency", "isoforms",                  
"type"), row.names = 8:10, class = "data.frame")

I have two data frames:

  frequency isoforms        type                                               
8         17        8 rat_ensembl                                               
9          8        9 rat_ensembl                                               
10        20       10 rat_ensembl

and

   frequency isoforms    type                                                   
8         37        8 rat_all                                                   
9         31        9 rat_all                                                   
10        14       10 rat_all                                                   
11        11       11 rat_all                                                   
12         2       12 rat_all                                                   
13         3       13 rat_all

I'd like to combine these into one data frame, but also to include the missing isoforms entries that appear in the rat_all data frame but not the rat_ensembl data frame. So I'd like the output to be a combined data frame as if I rbinded the two data frames, but augmented with:

11         0       11 rat_ensembl
12         0       12 rat_ensembl
13         0       13 rat_ensembl

I thought I could do it with merge but I wind up getting a huge mess that I have to unwind that I can eventually massage into the right format but it is not a good solution if I wanted to do this for four or five different 'types' at once. What am I missing? Thanks!

To be clear I'm looking to get a final data frame that looks like:

      frequency isoforms        type                                               
1         17        8 rat_ensembl                                               
2          8        9 rat_ensembl                                               
3         20       10 rat_ensembl                                                   
4         37        8 rat_all                                                   
5         31        9 rat_all                                                   
6         14       10 rat_all                                                   
7         11       11 rat_all                                                   
8          2       12 rat_all                                                   
9          3       13 rat_all   
10         0       11 rat_ensembl
11         0       12 rat_ensembl
12         0       13 rat_ensembl

I can kind of get it to do what I want if I use:

z = merge(rat_ensembl, rat_all, by.x="isoforms", by.y="isoforms", all.y=TRUE)
   isoforms frequency.x      type.x frequency.y  type.y                         
7         7          44 rat_ensembl          69 rat_all                         
8         8          17 rat_ensembl          37 rat_all                         
9         9           8 rat_ensembl          31 rat_all                         
10       10          20 rat_ensembl          14 rat_all                         
11       11          NA        <NA>          11 rat_all                         
12       12          NA        <NA>           2 rat_all                         
13       13          NA        <NA>           3 rat_all                         
14       14          NA        <NA>           1 rat_all

Then, theoretically I could select out the isoforms, frequency.x, type.x columns and fix them so they are correct for each of rat_ensembl and rat_all and then rbind those data frames together but it seems like there should be something to just handle it directly.

what have you tried with merge? What are the common column(s) you want to merge on? frequency, isoforms, type? All of the above? Once you identify the common columns, then it is a matter of specifying whether you want an inner, left, right, or outer join by specifying the "all" arguments. Also, can you update your question with a code snippet that people can paste into their R sessions? use `dput()` and paste the contents into your question. — Chase, May 03 '11 at 15:37
Thank you for the dput suggestion, it is very helpful. I added the extra information to the post. — rory, May 03 '11 at 16:11

Luciano Selzer · Accepted Answer · 2011-05-03T17:09:07.810

2

maybe you want something like this

z <- merge(rat_ensembl, rat_all, all = TRUE)

iso_diff <- setdiff(rat_all$isoforms, rat_ensembl$isoforms)

augmented <- data.frame(frequency = 0, isoforms = iso_diff, type = "rat_ensembl", stringsAsFactors= FALSE)

df_all <- rbind(z, augmented)

Hope that helps.

edited May 03 '11 at 17:09

answered May 03 '11 at 15:36

Luciano Selzer

9,806
3
42
40

Hi Iselzer, that seems to just give me the same result as rbinding the two data frames together. I'm looking for some way to augment that with some data that is missing from one of the data frames. I updated my post to be a little more clear. Thanks! – rory May 03 '11 at 16:31
@rory, I just uptade my anwser to include what you really wanted. I don't know if there's an easier way to this. If someone knows it please post it. – Luciano Selzer May 03 '11 at 17:10
Thanks Iselzer, the setdiff function is a new one to me. I think this solution will work for me, but it only works if I know for sure that all the differences are due to one condition. In my case, though, that is the case. Thank you! – rory May 03 '11 at 17:16

Merging data frames with missing values in R

1 Answers1

Linked