How do I merge 2 data frames on R based on 2 columns?

Question

I am looking to merge 2 data frames based on 2 columns in R. The two data frames are called popr and dropped column, and they share the same 2 variables: USUBJID and TRTAG2N, which are the variables that I want to combine the 2 data frames by.

The merge function works when I am only trying to do it based off of one column:

merged <- merge(popr,droppedcol,by="USUBJID")

When I attempt to merge by using 2 columns and view the data frame "Duration", the table is empty and there are no values, only column headers. It says "no data available in table".

I am tasked with replicating the SAS code for this in R:

data duration;
  set pop combined1 ;
  by usubjid trtag2n;
run;

On R, I have tried the following

duration<- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")

duration <- merge(popr,droppedcol,by.x="USUBJID","TRTAG2N",by.y="USUBJID","TRTAG2N")

duration <- full_join(popr,droppedcol,by = c("USUBJID","TRTAG2N"))

duration <- merge(popr,droppedcol,by = c("USUBJID","TRTAG2N"))

I would like to see a data frame with the columns USUBJID, TRTAG2N, TRTAG2, and FUDURAG2, sorted by first FUDURAG2 and then USUBJID.

statsprogramming, welcome to SO! Please make this question *reproducible*. This includes sample *unambiguous* data (e.g., `dput(head(popr))` or `data.frame(x=...,y=...)`) and expected output. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. — r2evans, Jul 09 '19 at 21:04
It's hard to tell without a more complete example, but the last two things you tried are correct (for an full join or inner join, respectively); if the results aren't what you expect that typically means that the columns you're joining on do not, in fact, match in circumstances where you expect them to. For example, leading or trailing white space, etc. — joran, Jul 09 '19 at 21:05
The SAS code you show is NOT doing a merge. Instead it is just essentially appending one to the end of the other. The use of the BY statement is just causing the observations to be interleaved so that the resulting dataset is still sorted by them. In the language of SQL you are doing a UNION instead of a JOIN. — Tom, Jul 09 '19 at 21:32

score 1 · Accepted Answer · answered Jul 09 '19 at 21:46

Per the SAS documentation, Combining SAS Data Sets, and confirmed by the SAS guru, @Tom, in comments above, the set with by simply means you are interleaving the datasets. No merge (which by the way is also a SAS method which you do not use) is taking place:

Interleaving uses a SET statement and a BY statement to combine multiple data sets into one new data set. The number of observations in the new data set is the sum of the number of observations from the original data sets. However, the observations in the new data set are arranged by the values of the BY variable or variables and, within each BY group, by the order of the data sets in which they occur. You can interleave data sets either by using a BY variable or by using an index.

Therefore, the best translation of set without by in R is rbind(), and set with by is rbind + order (on the rows):

duration <- rbind(pop, combined1)                                # STACK DFs
duration <- with(duration, duration[order(usubjid, trtag2n),])   # ORDER ROWS

However, do note: rbind does not allow unmatched columns between the concatenated data sets. However, third-party packages allow for unmatched columns including: plyr::rbind.fill, dplyr::bind_rows, data.table::rbindlist.

How do I merge 2 data frames on R based on 2 columns?

1 Answers1