Merging data frames without duplicating data when several matches are possible

Question

I need merge two data frames, without having duplicate rows/duplicated data if there is more than one match. Basically, if the matching between my dataframes is ambiguous, I would like the ambiguous rows to NOT be matched, and each row to remain separate instead. I've been using the merge function, but it states that If there is more than one match, all possible matches contribute one row each, and I got to the same result using join/inner_join functions. Is there a way to fix this ?

Example:

df1

ID	Question 1	Question 2
A1	1	2
B1	3	4
C1	5	6
C1	7	8

df2

ID	Question 3	Question 4
A1	a	b
B1	c	d
C1	e	f
C1	g	h

What I get using merge by ID

ID	Question 1	Question 2	Question 3	Question 4
A1	1	2	a	b
B1	3	4	c	d
C1	5	6	e	f
C1	7	8	g	h
C1	5	6	g	h
C1	7	8	e	f

What I want

ID	Question 1	Question 2	Question 3	Question 4
A1	1	2	a	b
B1	3	4	c	d
C1	5	6	NA	NA
C1	7	8	NA	NA
C1	NA	NA	e	f
C1	NA	NA	g	h

Thank you for your help !

score 2 · Accepted Answer · answered May 22 '23 at 14:33

You can try to identify the duplicated id's, exclude them when using merge and rbind them afterwards.

X <- unique(c(df1$ID[duplicated(df1$ID)], df2$ID[duplicated(df2$ID)]))
plyr::rbind.fill(merge(df1[!df1$ID %in% X,], df2[!df2$ID %in% X,]),
                 df1[df1$ID %in% X,],
                 df2[df2$ID %in% X,])
#  ID Question1 Question2 Question3 Question4
#1 A1         1         2         a         b
#2 B1         3         4         c         d
#3 C1         5         6      <NA>      <NA>
#4 C1         7         8      <NA>      <NA>
#5 C1        NA        NA         e         f
#6 C1        NA        NA         g         h

Data

df2 <- read.table(header=TRUE, text="ID     Question3   Question4
A1  a   b
B1  c   d
C1  e   f
C1  g   h")

df1 <- read.table(header=TRUE, text="
ID  Question1   Question2
A1  1   2
B1  3   4
C1  5   6
C1  7   8")

Wow, this is such a quick and efficient way, thank you so much ! Is it possible to use *unique* for two colums at the same time ? (my post was oversimplified, in real life, I have IDs + dates for the matching, and my problem is duplicates on ID+date combinations) — Myscellia, May 22 '23 at 15:24
Yes both `unique` and `duplicated` work also for multiple columns, but `%in%` will not. There you can use `interaction`, `paste` or `asplit`. See: [match two data.frames based on multiple columns](https://stackoverflow.com/questions/26596305). — GKi, May 23 '23 at 07:13

DashdotdotDashdotdot · Answer 2 · 2023-05-22T15:06:21.647

I always seek to avoid many to many merges by creating a case variable for each duplicate id and then merge by ID case.

Like below:

df1 <- data.frame(ID = c("A1", "B1", "C1", "C1"), Q1 = c(1, 3, 4, 7), Q2 = c(2, 4, 6, 8))
df2 <- data.frame(ID = c("A1", "B1", "C1", "C1"), Q3 = c("a", "b", "c", "d"), Q4 = c("b", "d", "f", "h"))
library(dplyr)

df3 <- full_join(df1, df2, by = "ID")

df3

df11 <- df1 %>%
  group_by(ID) %>%
  mutate(case = row_number())

df22 <- df2 %>%
  group_by(ID) %>%
  mutate(case = row_number())

tmp <- df22 %>% summarise(count = n())

df22 <- merge(df22, tmp, by = c("ID"))

df22 <- df22 %>% mutate (case = ifelse(count==2,case+2,case))
df22

df33 <- full_join(df11, df22, by = c("ID", "case"))
df33

which returns:

Merging data frames without duplicating data when several matches are possible

2 Answers2