3

I have two large datasets, d1 and d2, which I would like to merge based on matches of EITHER variables idA and idB.

Both are present in both datasets, but both can contain errors and missing values (NAs), or have idA and idB point to different observations in the other dataset. Se example bellow.

The desired outcome is listed bellow. Basically, matches are on either idA OR idB, In case of duplicated matches, preference should be given to idA matches over idB matches.

Actual datasets are quite large (~10 milion).

Is there and efficient way to implement this in R?

Also, is there a technical term for this sort of join in SQL?

library(tidyverse)
library(data.table)

d1 <- read.table(text=
"idA idB value1
A 10 500
B 1  111
C 4  234
D NA 400
E 7  500
NA 3 700
Z 5  543
Q 9  567
U 23 723
", 
header=T) %>% data.table


d2 <- read.table(text=
"idA idB value2
A 10 11
B 1  12
L 21 15
D 15 12
E 8  17
M 3  18
N 5  13
Z 25 17
Q 23 12
",
header=T) %>% data.table

Desired outcome is:

Out <- read.table(text=
"d1.idA d2.idA d1.idB d2.idB d1.v1 d2.v2
A  A  10 10 500 11 # matched on idA and idB
B  B  1  1  111 12 # matched on idA and idB
D  D  NA 15 400 12 # matched on idA. d2.idB had NAs 
E  E  7  8  500 17 # matched on idA. idB had divergent values
NA M  3  3  700 18 # matched on idB. d1.idA had NAs
Z  Z  5  25 543 13 # d1[7,] matched to d2[8,] on idA and d2[9,] on idB. Priority given to idA match. 
Q  Q  9  23 657 17 #  d2[9,] matched to d1[8,] on idA and d1[9,] on idB. Priority given to idA match.
",
header=T) %>% data.table

#Non matched rows
# d1[3,]
# d2[3,]

EDIT1:

  • added desired outcome
  • kept data.table(read.table) due to ease of parsing by readers, actual data comes from fread(file)

EDIT2: remove non matched rows from desired outcome

LucasMation
  • 2,408
  • 2
  • 22
  • 45
  • 1
    @akrun, I don´t see how this is a duplicate of that. I am asking about how to join by two id variables, requiring that only one of them match. (the question refers to inner, left, rigth joins of the same variable) – LucasMation Feb 10 '17 at 10:00
  • Please complete your example with the desired output before lobbying to reopen. "preference should be givon to idA matches over idB" is not really clear on its own. Also, fyi, if the datasets are large, you'll want to use `fread` instead of `data.table(read.table(.))` which is doubly slow -- slower to read in and taking an unnecessary copy when converting to DT. – Frank Feb 10 '17 at 15:16
  • added the requested features. Still the original question is completely unrelated to the question indicated as duplicate – LucasMation Feb 10 '17 at 18:22

2 Answers2

2

I'm not aware of an elegant way to achieve the desired result (and neither I'm aware of a technical term for this type of operation in SQL).

Therefore, I suggest to accomplish this in four steps:

  1. Inner join the two data.tables on idA and idB, determine the remaining rows in each of the data.tables.
  2. Inner join the remaining rows of the two data.tables on idA, determine remaining rows, again.
  3. Inner join the remaining rows of the two data.tables on idB.
  4. Combine the results of the previous steps.

Code for all 4 steps:

library(data.table)

# create index column in both data.tables
d1[, idx := .I]
d2[, idx := .I]

# inner join on idA and idB
j1 <- d1[d2, .(idx, i.idx), on = c("idA", "idB"), nomatch = 0L]
m1 <- unique(j1$idx)
m2 <- unique(j1$i.idx)

# inner join on idA
j2 <- d1[!(idx %in% m1)][d2[!(idx %in% m2)], .(idx, i.idx), on = c("idA"), nomatch = 0L]
m1 <- append(m1, unique(j2$idx))
m2 <- append(m2, unique(j2$i.idx))

# inner join on idB
j3 <- d1[!(idx %in% m1)][d2[!(idx %in% m2)], .(idx, i.idx), on = c("idB"), nomatch = 0L]
m1 <- append(m1, unique(j3$idx))
m2 <- append(m2, unique(j3$i.idx))

# combine results
rbindlist(
  list(
    AB = cbind(
      d1[idx %in% j1[, idx]],
      d2[idx %in% j1[, i.idx]]),
    A. = cbind(
      d1[idx %in% j2[, idx]],
      d2[idx %in% j2[, i.idx]]),
    .B = cbind(
      d1[idx %in% j3[, idx]],
      d2[idx %in% j3[, i.idx]])),
  fill = TRUE, 
  idcol = "match_on")

which yields

#   match_on idA idB value1 idx idA idB value2 idx
#1:       AB   A  10    500   1   A  10     11   1
#2:       AB   B   1    111   2   B   1     12   2
#3:       A.   D  NA    400   4   D  15     12   4
#4:       A.   E   7    500   5   E   8     17   5
#5:       A.   Z   5    543   7   Z  25     17   8
#6:       A.   Q   9    567   8   Q  23     12   9
#7:       .B  NA   3    700   6   M   3     18   6

m1 and m2 are used to remember the row ids of the the rows in d1 and d2, resp., which already have been used in one of the previous join operations.

So, finally the remaining rows in d1 and d2 which haven't found a match can be printed:

d1[!(idx %in% m1)]
#   idA idB value1 idx
#1:   C   4    234   3
#2:   U  23    723   9
d2[!(idx %in% m2)]
#   idA idB value2 idx
#1:   L  21     15   3
#2:   N   5     13   7   

Note, that in each join operation only the row indices are retained instead of keeping all columns. The results of the different join operations differ in name and position of the columns.

In the final combination step, the rows of the originating data.tables d1 and d2 are selected by using these indices to produce a uniform looking result table.

Uwe
  • 41,420
  • 11
  • 90
  • 134
2

Over a year later, I believe I have found a more straight forward way than in my previous answer. The approach is quite different and can be scaled to more than just two id columns, so I feel it is better to post it as a seperate answer for clarity.

The OP has requested to find matches on either idA or idB. In case of duplicated matches, preference should be given to idA matches over idB matches.

This approach reshapes the id columns in both data.tables from wide to long format so that the name of the id columns become data items which can be used in subsequent join operations. This reshaping is the reason why this approach can be scaled to include an arbitrary number of id columns, e.g., to find matches in idA or idB or idC.

library(data.table)
library(magrittr)

# create index column in both data.tables
d1[, idx := .I]
d2[, idx := .I]

# reshape only id columns from wide to long format
# rename columns just for clarity
l1 <- d1[, .(idx, idA, idB)] %>% 
  melt(id.vars = "idx", variable.name = "id.col", value.name = "key")
l2 <- d2[, .(idx, idA, idB)] %>% 
  melt(id.vars = "idx", variable.name = "id.col", value.name = "key")

# inner join to find all matching combinations of name of id col & key value
m <- l1[l2, on = .(id.col, key), 
        .(match_on = id.col, i1 = idx, i2 = i.idx), nomatch = 0L]

# remove duplicate entries with precedence of "idA"
m %<>% 
  setorder(match_on) %>% 
  unique(by = c("i1")) %>% 
  unique(by = c("i2")) %>% 
  setorder(i1, i2)

# double join of index table with original tables
d1[d2[m, on = .(idx = i2)], on = .(idx = i1)]
    idA idB value1 idx i.idA i.idB value2 i.idx match_on
1:    A  10    500   1     A    10     11     1      idA
2:    B   1    111   2     B     1     12     2      idA
3:    D  NA    400   4     D    15     12     4      idA
4:    E   7    500   5     E     8     17     5      idA
5: <NA>   3    700   6     M     3     18     6      idB
6:    Z   5    543   7     Z    25     17     8      idA
7:    Q   9    567   8     Q    23     12     9      idA

Some additional remarks:

  1. Row indices are introduced to save memory for storing intermediate results instead of keeping all data columns.
  2. The values of idA and idB are coerced to type character when they are combined in the key column by reshaping.
  3. setorder(match_on) is the crucial part which determines precedence of idA over idB. Just by chance, idA is sorted lexicographically before idB. Other precedences can be forced by rearranging the factor levels, e.g., using the forcats package.
Uwe
  • 41,420
  • 11
  • 90
  • 134