Match two datasets across multiple ‘dirty’ columns in R

Question

I frequently need to match two datasets by multiple matching columns, for two reasons. First, each of these characteristics are ‘dirty’, meaning a single column does not consistently match even when it should (for a truly matching row). Second, the characteristics are not unique (e.g., male and female). Matching like this is useful for matching across time (pre-test with post-test scores), different data modalities (observed characteristics and lab values), or multiple datasets for research participants.

I need a heuristic that selects the best match. Then I can perform analyses of the two together, as described in this question. Note there are many matching columns, and many IDs, so they must both be specified as a list or vector. As an example, I have created two datasets below to match. In the example, DT1 row 1 (ID 1) is the best match for DT2 row 1 (ID 55), even though only the ‘match4’ column matches; this is because DT2 rows 2 and 3 are better matches for DT1 rows 2 and 3. Bonus: DT1 row 7 equally matches DT2 rows 7 and 8, but has a partial match to DT2 row 7, so ideally that would be selected.

Question: For DT1, select a “best guess” for the matching row from DT2, and use each row from DT2 only once. What is the best way to do this (in an efficient and “best practices” idiomatic way) in R?

My preliminary approach: I created a third data.table with a column of IDs from DT1, called DTmatch. All subsequent columns will be IDs from DT2. For the second column of DTmatch (named after the first ID of DT2), each value should represent the count of matching columns (in this example, 0 to 4). Next, find the highest match values in the matching table unique to each row and column. Lastly, create a final column that specifies the DT2 ID that matches the DT1 ID (column 1 in DTmatch).

library(data.table)
# In this example, the datasets are matched by row number, but the real data is not.
DT1 = data.table(
  ID = 1:7,
  match1 = c("b","b","b","a","a","c",NA),
  match2 = c(7, 8, 9, NA, NA, NA, NA),
  match3 = c(0, 0, 0, "j", 13:15),
  match4 = c(rep("m", 4), rep("f", 3)),
  value1 = 45:51,
  value2 = 100:106
)

DT2 = data.table(
  ID = 55:62,
  match1 = c("b","b",4,"a","a","c","j","j"),
  match2 = c(77, 8:14),
  match3 = c(9:14, 155, 16),
  match4 = c(rep("m", 4), NA, rep("f", 3)),
  value1 = 145:152,
  value2 = 101:108
)

# Fix numeric IDs
DT1[, ID := make.names(ID)]
DT2[, ID := make.names(ID)]

# Make new matching table
DTmatch <- DT1[, .(make.names(ID))]
setnames(DTmatch, old = "V1", new = "DT1ID")

# Start with one ID and one matching column
DT2ID <- DT2$ID[1]
DTmatch[, (DT2ID) := 0]
matchingCols <- c("match1")

# Code for first ID and match1, to be adapted for all IDs and all columns
DTmatch[, (DT2ID) := eval(parse(text=DT2ID)) + as.numeric(DT1[, (matchingCols), with=F] == DT2[ID==DT2ID, matchingCols, with=F][[1]])]

# First attempt at matching doesn't work due to NAs
for (thisID in DT2$ID) {
  DTmatch[, (thisID) := 0]
  for (matchingCol in matchingCols) {
#    if (!is.na(DT1[, matchingCol, with=F]) & !is.na(DT2[ID==thisID, matchingCol, with=F])) {
      DTmatch[, (thisID) := eval(parse(text=thisID)) + as.numeric(DT1[, (matchingCol), with=F] == DT2[ID==thisID, matchingCol, with=F][[1]])]
#    }
  }
}

Wimpel · Accepted Answer · 2019-02-18T08:46:10.040

0

Perhaps this is an option to start with:

first, create a new column, by pasting all values from the match-columns together

#create new column based on matching cols
DT1[, col_join := do.call( paste, c(.SD, sep="")), .SDcols= match1:match4][]
DT2[, col_join := do.call( paste, c(.SD, sep="")), .SDcols= match1:match4][]

Then, using the fuzzyjoin-package, you can perform a join based on string-distance. Below, the maximum distance is set to 2. So if no matching string is found within a distance of 2, the result of the join will be <NA>.
You can/should experiment with the different stringdist-methods and the maximum distance...

library(fuzzyjoin)
result <- stringdist_join( DT2, DT1, 
                           by = "col_join", 
                           max_dist = 2, 
                           mode = "left", 
                           distance_col = "string_distance" )

result[,c(1,8,9,16,17)][]
# ID.x col_join.x ID.y col_join.y string_distance
# 1:   55      b779m    1       b70m               2
# 2:   56      b810m    1       b70m               2
# 3:   56      b810m    2       b80m               1
# 4:   56      b810m    3       b90m               2
# 5:   57      4911m   NA       <NA>              NA
# 6:   58     a1012m   NA       <NA>              NA
# 7:   59    a1113NA   NA       <NA>              NA
# 8:   60     c1214f    6     cNA14f               2
# 9:   61    j13155f   NA       <NA>              NA
# 10:   62     j1416f   NA       <NA>              NA

As you can see, you will still have to figure out some stuff, like "what to do with NA-values".
With Fuzzy joining there is always (in my opinion) a lot of trail-and-error involved. And a lot of times you will have to accept that 'the perfect answer' is just not out there...

edited Feb 18 '19 at 08:46

answered Feb 18 '19 at 08:37

Wimpel

26,031
1
20
37

Thanks for the fuzzyjoin suggestion. In this example, all rows can be matched by replacing NA with "", using sep="." for the pasting, and setting "method = "jw", p=.1," for the stringdist. I figured out how to match by hand, but need to figure a way to code it: A) sort by lowest string_distance; B) for each value of string_distance, identify any unique matchings of ID.y with ID.x and then remove all other ID.x from the list. C) Repeat step B. – Kayle Sawyer Feb 19 '19 at 05:15
After using sep="." with Jaro-Winker, the following code appears to work: `result[result[, .I[string_distance == min(string_distance)], by=ID.x]$V1][result[, .I[1], by = ID.y]$V1][order(ID.y)]`. – Kayle Sawyer Feb 19 '19 at 06:29
Alas, my code that took the minimum string_distance for ID.x, then ID.y, worked on the example datasets, but not my dataset. It's a challenging puzzle to pick out best matching using string_distance because the stringdist_join allows each row in DT1 to match multiple rows from DT2. It needs to be a unique (1 to 1) matching, taking the lowest unambiguous string_distance values first. – Kayle Sawyer Feb 20 '19 at 03:33
@KayleSawyer after the join, you can select the match with the minimum stringdist. – Wimpel Feb 20 '19 at 07:15
The minimum stringdist is not sufficient because I need a unique ID.x for each ID.y, and each ID.y and ID.x must be used only once (i.e. they must be matched up). I posted a new question here: https://stackoverflow.com/questions/54781936/select-unique-x-and-y-pairs-to-minimize-a-value – Kayle Sawyer Feb 20 '19 at 08:36
This answer completes the fuzzy join by selecting the matches with the best string_distance: https://stackoverflow.com/a/54782546/ – Kayle Sawyer Mar 04 '19 at 01:31

Match two datasets across multiple ‘dirty’ columns in R

1 Answers1

Linked