I need to select a unique ID.x for each ID.y (forming unique pairs) that minimizes a distance value, starting from the lowest distance values. I feel like it's a bit like a sudoku puzzle because each x and y can only be used once, so information from each pair allows for matching other pairs.
In the example below, ID.x 55 is a better match for ID.y 1 than ID.x 56 is, because ID.x 56 is a better match for ID.y 2. Similarly, ID.x 58 can be matched to ID.y 4, because any other available option would be a greater distance, and ID.y 5 can then take ID.x 59 at distance 4. However, ID.y 7 cannot be matched because ID.x 61 and ID.x 62 are equally close.
Example:
DT = data.table(
ID.x = c("55", "55", "55", "55", "55", "55", "55", "56", "56", "56", "56", "56", "56", "56", "57", "57", "57", "57", "57", "57", "57", "58", "58", "58", "58", "58", "58", "58", "59", "59", "59", "59", "59", "59", "59", "60", "60", "60", "60", "60", "60", "60", "61", "61", "61", "61", "61", "61", "61", "62", "62", "62", "62", "62", "62", "62"),
ID.y = c("1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7"),
distance = c("2", "3", "3", "4", "6", "6", "7", "2", "1", "2", "5", "5", "5", "6", "4", "4", "3", "5", "5", "5", "6", "5", "5", "5", "4", "4", "5", "6", "7", "7", "7", "6", "4", "6", "7", "6", "6", "6", "6", "4", "2", "5", "7", "7", "7", "7", "5", "5", "5", "6", "6", "6", "6", "4", "4", "5")
)
Goal:
ID.x ID.y distance
1: 55 1 2
2: 56 2 1
3: 57 3 3
4: 58 4 4
5: 59 5 4
6: 60 6 2
7: NA 7 NA
This first attempt, inspired by this question, does not work:
DT[DT[, .I[distance == min(distance)], by=ID.x]$V1][DT[, .I[1], by = ID.y]$V1]
UPDATE: In response to the answers by @chinsoon12 and @paweł-chabros, here is an updated data.table that fixes a few things. It swaps x and y (my original question was matching x's with y's, but the more natural interpretation is y with x). This example removes the ambiguous matching for ID.y 7. In this example, the lowest distance matches ID.x 63. Separately, I also added a new ID.y 8, to clarify when no unambiguous match is possible (it matches ID.x 64 and 65 equally well). The answer should not select a match arbitrarily.
DT = data.table(
ID.y = c("55", "55", "55", "55", "55", "55", "55", "55", "56", "56", "56", "56", "56", "56", "56", "56", "57", "57", "57", "57", "57", "57", "57", "57", "58", "58", "58", "58", "58", "58", "58", "58", "59", "59", "59", "59", "59", "59", "59", "59", "60", "60", "60", "60", "60", "60", "60", "60", "61", "61", "61", "61", "61", "61", "61", "61", "62", "62", "62", "62", "62", "62", "62", "62", "63", "63", "63", "63", "63", "63", "63", "63", "64", "64", "64", "64", "64", "64", "64", "64", "65", "65", "65", "65", "65", "65", "65", "65"),
ID.x = c("1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8"),
distance = c(2, 3, 3, 4, 6, 6, 7, 15, 2, 1, 2, 5, 5, 5, 6, 15, 4, 4, 3, 5, 5, 5, 6, 15, 5, 5, 5, 4, 4, 5, 6, 15, 7, 7, 7, 6, 4, 6, 7, 15, 6, 6, 6, 6, 4, 2, 5, 15, 7, 7, 7, 7, 5, 5, 6, 15, 6, 6, 6, 6, 4, 4, 10, 15, 11, 11, 11, 11, 11, 11, 5, 12, 11, 11, 11, 11, 11, 11, 11, 1, 11, 11, 11, 11, 11, 11, 11, 1)
)
Expected Result:
ID.y ID.x distance
1: 55 1 2
2: 56 2 1
3: 57 3 3
4: 58 4 4
5: 59 5 4
6: 60 6 2
7: 63 7 5
8: NA 8 NA
I'm using this code is to complete a fuzzy join using stringdist_join, as described in this question. I have two datasets that need matching (hence the ID.x and ID.y). In my case, I have pre-test and post-test scores that need to be matched by multiple unreliable characteristics.