1

Update: My apologies... I should have specified that there are missing values for some tests (come conducted n times, others conducted m times), and I'm thinking that's going to affect the answer based on the merge documentation:

all.x logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.

I don't want either of these options: namely, I don't want only matching row/column values kept, nor do I want non-matching row/column intersection values filled in with NAs.


I have data from an experiment and stored my chemical compositions in one file and testing in another. I'd like to merge these so I can plot boxplots faceted by composition factor level, but am having a hard time figuring out how to do it. Here's some example data (adjusted per the above):

set.seed(1)
df.ids <- data.frame(lot=c("A","B","C"),
                     var1=c(1,2,3),
                     var2=c(4,5,6))

df.testing <- data.frame(lot=c(rep("A",4), rep("B",4),rep("C",4)),
                         test1=rep(c(runif(3,min=1,max=5),NA),3),
                         test2=rep(c(runif(2,min=1,max=5),NA,NA),3),
                         test3=rep(runif(4,min=1,max=5),3))

Like my real data, the ids data frame has one row per factor level, but the test data has multiple for each test replicate. How might I merge these so that I fill in the compositions for each of the lots? In other words, my resultant data frame would look like this:

| lot |    test1 | test2    |    test3 | var1 | var2 |
|-----+----------+----------+----------+------+------|
| A   | 2.062035 | 4.632831 | 4.593559 | 1    | 4    |
| A   | 2.488496 | 1.806728 | 4.778701 | 1    | 4    |
| A   | 3.291413 | NA       | 3.643191 | 1    | 4    |
| A   |       NA | NA       | 3.516456 | 1    | 4    |
| B   | 2.062035 | 4.632831 | 4.593559 | 2    | 5    |
| B   | 2.488496 | 1.806728 | 4.778701 | 2    | 5    |
| B   | 3.291413 | NA       | 3.643191 | 2    | 5    |
| B   |       NA | NA       | 3.516456 | 2    | 5    |
| C   | 2.062035 | 4.632831 | 4.593559 | 3    | 6    |
| C   | 2.488496 | 1.806728 | 4.778701 | 3    | 6    |
| C   | 3.291413 | NA       | 3.643191 | 3    | 6    |
| C   |       NA | NA       | 3.516456 | 3    | 6    |

In the end, I suppose all I'm trying to do is match unique lot IDs and then just replicate the values from df.ids for each of it's columns into df.testing. I just don't know how (well, I thought of doing it with a for() loop, but I'm sure there's a better way)!

Hendy
  • 10,182
  • 15
  • 65
  • 71
  • @ttmaccer I just edited my answer... I have NAs which might affect the answer. I know that the default merge removes nearly all of my rows (~700 down to 40), so there's something going on. – Hendy Aug 30 '12 at 23:10
  • @ttmaccer Not fully understanding `merge()` and being new to R, I don't know entirely what was going on, but I just had to use `merge(df.1, df.2, by="common.column.name")` and it worked. – Hendy Aug 30 '12 at 23:18

1 Answers1

2

You can just rearrange the order of the columns:

> merge(df.ids, df.testing)
  lot var1 var2    test1    test2
1   A    1    4 3.389578 11.55608
2   A    1    4 9.502077 15.47929
3   A    1    4 9.173870 17.18321
4   B    2    5 4.349115 12.85377
5   B    2    5 2.815137 14.45693
6   B    2    5 6.947180 17.45857
7   C    3    6 9.085507 17.92857
8   C    3    6 6.155680 12.58901
9   C    3    6 6.662026 19.92715

> merge(df.ids, df.testing)[c(1,4:5,2:3)]
  lot    test1    test2 var1 var2
1   A 3.389578 11.55608    1    4
2   A 9.502077 15.47929    1    4
3   A 9.173870 17.18321    1    4
4   B 4.349115 12.85377    2    5
5   B 2.815137 14.45693    2    5
6   B 6.947180 17.45857    2    5
7   C 9.085507 17.92857    3    6
8   C 6.155680 12.58901    3    6  
9   C 6.662026 19.92715    3    6

This turns out to be an inner-join because there are no missing items in the merge-by columns, but the by.x and by.y parameters can be chosen so you get a left-outer or a right-outer join if needed. If for some reason you wanted the order of the items in df.testing to be maintained, you would need to create a column that got carried along, which would then be order()ed-by.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Sorry for possible confusion. I do have missing values in some columns (see the update). I tried using `by.x=` and `by.y` values, but am not sure which to select. I tried df.ids$lot and df.testing$lot (well, the equivalent in my actual data set) and got `Error in fix.by(by.x, x) : 'by' must specify uniquely valid column(s)` – Hendy Aug 30 '12 at 23:07
  • Sorry again... merge still works just fine on the test data, but most definitely not on my real data. I don't know what's going on. – Hendy Aug 30 '12 at 23:13
  • Last time... I just had to specify `by=common.column.name`. Sigh. That was so simple. – Hendy Aug 30 '12 at 23:17
  • Last question. I have some replicate column names in both df's. They are being both included in the form of `colname.x` and `colname.y`; is there a way to prevent these duplicates? I thought specifying `all=F` might do it, but I get the same result. – Hendy Aug 30 '12 at 23:31
  • Solved it. I thought I had to tell it *not* to include duplicates via `all[.x,y]=F`; in actuality, I found [another question](http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right?lq=1) that turned on the light bulb that setting `all.x=T` is telling merge to *keep* the data from x when there are duplicate column names. I'm officially a happy camper! – Hendy Aug 30 '12 at 23:47