0

I am attempting to merge two data frames using right join with dplyr but wondering if the cross join is what I should be using. I have read through multiple other related threads including this one which was helpful but to no avail.

In short, the right join works for me, but it reproduces the common variables between the two data frames as separate .x and .y variables. I just want to add a specific variable (column) from the left (x) data frame to the right (y) data frame (which has more rows/observations) matching on the "SUBJECT" key variable.

I am not sure how to modify the below code to achieve the desired result such that observations on the specific variable merged from the x data frame which does not have a value for this variable in the y data frame is "N/A". Of note, virtually all of the variables present in x are present in y (X is a subset of Y), but I only want to match on the SUBJECT variable. Below is my current code.

DF_merged <- right_join(DF.X, DF.Y, by = 'SUBJECT')
Z.Lin
  • 28,055
  • 6
  • 54
  • 94
Jhaltiga68
  • 125
  • 9

1 Answers1

0

You may want to select a subset of the dataframe that you want to merge. One option is to simply use base R for merging, in which by.y indicates the right-join.

I tried to create a reproducible example here:

DF.X <- data.frame(SUBJECT = 1:5,
                   COLUMN.A = 5:9,
                   COLUMN.B = 11:15)

DF.Y <- data.frame(SUBJECT = 1:15,
                   COLUMN.B = 11:25,
                   COLUMN.C = 15:29)

DF_merged <- merge(x = DF.X[ ,c("SUBJECT", "COLUMN.A")], y = DF.Y, by = "SUBJECT", all.y = TRUE)
  • Using that code resulted in the error message below. I am wondering if this is b/c "SUBJECT" is only available for a subset of the Y cases. To clarify: the X data frame has only a subset of observations as Y, but identical columns and two unique columns I want to merge with the Y data frame. Observations in Y that do not have values for these unique X columns would return (ideally) N/As Error: `by` can't contain join column `SUBJECT` which is missing from LHS Call `rlang::last_error()` to see a backtrace – Jhaltiga68 Oct 22 '19 at 20:52
  • @G.Grothendieck saw this contribution from you and was hoping you could articulate how it may apply in the case of using a 'key' variable (SUBJECT) https://stackoverflow.com/a/14103393/5825064 – Jhaltiga68 Oct 22 '19 at 23:40
  • I edited the code, as I mixed up base R with dplyr (right-join). In the example above, I suggested the base-R option. – LukeXywalker Oct 23 '19 at 14:21
  • The merge route did indeed work and was what I went with. It is somewhat awkward/non optimal however in that the resulting data frame has both .x and .y variables that are retained. It's not a huge problem as the variables are clearly distinguishable as a function of row n's, but wondering how to only retain the "full" variables. To be more specific: I am only adding variables two a data frame, not cases, as the base data frame has all cases. – Jhaltiga68 Oct 23 '19 at 16:08
  • Could you recreate the example you are describing? If you wish to only add columns, make sure that in the subset only 'unique' variables are allowed (except for ID, obviously). Otherwise, consider cbind if you simply want to add columns to an existing dataframe – LukeXywalker Oct 23 '19 at 16:53
  • I wrangled with cbind originally but it requires if not mistaken columns of same length which is not the case in my scenario. – Jhaltiga68 Oct 23 '19 at 17:36
  • That's correct. If both df.X and df.Y have columns A, B, and you only want COLUMN.A from df.Y then add this to the subset for merging: DF.Y[ ,c["COLUMN.A", "COLUMN.B" and make sure it is not automatically selected for DF.X (without specification, all columns are added). Hope this helps! – LukeXywalker Oct 24 '19 at 16:47
  • Would this have to be done for each instance of the duplicate columns? The base data file has >50 variables to which I will add vectors from other data frames with those same variables, as well as 1 or 2 unique ones that I need to match by "SUBJECT". As it is now, with both instantiations of .x and .y (for the duplicate variables from both data frames) it is quite messy and non optimal. – Jhaltiga68 Oct 24 '19 at 18:14
  • Perhaps indexing can do the job? So instead of SUBJECT, COLUMN.A you choose 1:2, so in your example you could select columns one to fifty using 1:50. – LukeXywalker Nov 01 '19 at 15:49
  • A work-around that I settled on was to re-create the 'added' data frames with only the key variable and the unique variables that I wanted to add to the .x 'master' data frame. I removed and then recreated those data frames, but one could just as simply (if few variables) have set them to null (i.e., deleted those columns) – Jhaltiga68 Nov 06 '19 at 02:25