I have two large datasets, d1
and d2
, which I would like to merge based on matches of EITHER variables idA
and idB
.
Both are present in both datasets, but both can contain errors and missing values (NAs), or have idA
and idB
point to different observations in the other dataset. Se example bellow.
The desired outcome is listed bellow. Basically, matches are on either idA OR idB, In case of duplicated matches, preference should be given to idA
matches over idB
matches.
Actual datasets are quite large (~10 milion).
Is there and efficient way to implement this in R?
Also, is there a technical term for this sort of join in SQL?
library(tidyverse)
library(data.table)
d1 <- read.table(text=
"idA idB value1
A 10 500
B 1 111
C 4 234
D NA 400
E 7 500
NA 3 700
Z 5 543
Q 9 567
U 23 723
",
header=T) %>% data.table
d2 <- read.table(text=
"idA idB value2
A 10 11
B 1 12
L 21 15
D 15 12
E 8 17
M 3 18
N 5 13
Z 25 17
Q 23 12
",
header=T) %>% data.table
Desired outcome is:
Out <- read.table(text=
"d1.idA d2.idA d1.idB d2.idB d1.v1 d2.v2
A A 10 10 500 11 # matched on idA and idB
B B 1 1 111 12 # matched on idA and idB
D D NA 15 400 12 # matched on idA. d2.idB had NAs
E E 7 8 500 17 # matched on idA. idB had divergent values
NA M 3 3 700 18 # matched on idB. d1.idA had NAs
Z Z 5 25 543 13 # d1[7,] matched to d2[8,] on idA and d2[9,] on idB. Priority given to idA match.
Q Q 9 23 657 17 # d2[9,] matched to d1[8,] on idA and d1[9,] on idB. Priority given to idA match.
",
header=T) %>% data.table
#Non matched rows
# d1[3,]
# d2[3,]
EDIT1:
- added desired outcome
- kept data.table(read.table) due to ease of parsing by readers, actual data comes from fread(file)
EDIT2: remove non matched rows from desired outcome