Merge only one column from two different datasets with uneven rows based on unique row ID

Question

I am working with two dataframes that have some shared columns but also different columns. The first one has date and ID. The second one has ID, sex, release type (rt), and release year (ry). Here are two dummy dataframes.

library(tidyverse)

#First dataframe

date = c("2015-05-01","2015-05-04","2015-05-05","2015-07-01","2015-07-02","2015-07-05","2015-07-06",
         "2015-05-01","2015-05-04","2015-05-05","2015-05-27","2015-05-28","2015-06-05","2015-06-06",
         "2015-05-01","2015-05-02","2015-05-03","2015-05-04","2015-05-05","2015-05-06","2015-05-07")

ID = c("0-10","0-10","0-10","0-10","0-10","0-10","0-10",
       "0-2","0-2","0-2","0-2","0-2","0-2","0-2",
       "0-8","0-8","0-8","0-8","0-8","0-8","0-8")

df1 = data.frame(date,ID)
df1$date = as.Date(df1$date)

Second dataframe

ID = c("0-10", "0-2","0-8","0-4","5-0")
sex = c("M","F","Unknown","F","F")
rt = c("S", "S", "H", "H", "S")
ry = c("2015", "2016","2015", "2015", "2016")
df2 <- data.frame(ID, sex, rt, ry)

df2$sex = as.factor(df2$sex)

What I need is, for EACH entry in dataframe one (df1), info on sex to be automatically filled in from dataframe two (df2) without merging other columns in the second dataframe. I.e., I only need a sex column in the df1.

I have tried the codes for the dplyr package in this thread How to join (merge) data frames (inner, outer, left, right) but when I tried it, it merged all the columns not just the one I needed.

Any help is appreciated. (I also prefer using the dplyr package). Thank you.

this is a merge/join operation, see https://stackoverflow.com/q/1299871/3358272, https://stackoverflow.com/q/5706437/3358272. Namely, `left_join(df1, df2, by = "ID")` — r2evans, May 12 '23 at 17:01
I hope the dupe links and the code in my previous comment were sufficient to get what you need. (FYI, the code assumes that `df1$sex` has _not_ been previously created. If it is already there, you will find both `sex.x` and `sex.y` in the results, since `left_join` did not want to delete a pre-existing column of the same name.) If this doesn't do what you expect, please [edit] your question and include your expected output, then @-ping me, and we'll sort it out, reopening if needed. Good luck Rnoobie! — r2evans, May 12 '23 at 17:04
df1 has a sex column too. I will try deleting the column, and try this again. — Rnoobie, May 12 '23 at 17:05
It does not work. After deleting the sex column from df1 I get the following message. "Error: Join columns must be present in data. x Problem with `sex`." — Rnoobie, May 12 '23 at 17:06
My point was that in your sample data, you pre-filled `df1$sex` as a placeholder for what would come from `df2`. That step is not necessary, and in fact gums up _a little_ by renaming the pre-existing `sex` to `sex.x` and naming the new `sex` (from `df2`) as `sex.y`. If you _don't_ assign `df1$sex` as you did using `add_column`, and therefore `df1` does not have a `sex` column, then it should work. Removal of `sex` from `df2` makes no sense, that's the reference column, it must be there. — r2evans, May 12 '23 at 17:10
I did not remove sex column from df2. Of course, it doesn't make sense to delete it from df2. I only deleted it from df1 by deleting df1 = df1 %>% add_column(sex =NA). I found the link you shared before as well, but it didn't work. — Rnoobie, May 12 '23 at 17:22
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253636/discussion-between-rnoobie-and-r2evans). — Rnoobie, May 12 '23 at 18:34
After discussion, we resolved on `left_join(df1, select(df2, ID, sex), by = "ID")` as the way to join in only one of many other columns from `df2`. — r2evans, May 12 '23 at 18:49

Merge only one column from two different datasets with uneven rows based on unique row ID

0 Answers0