1

I need to select a unique ID.x for each ID.y (forming unique pairs) that minimizes a distance value, starting from the lowest distance values. I feel like it's a bit like a sudoku puzzle because each x and y can only be used once, so information from each pair allows for matching other pairs.

In the example below, ID.x 55 is a better match for ID.y 1 than ID.x 56 is, because ID.x 56 is a better match for ID.y 2. Similarly, ID.x 58 can be matched to ID.y 4, because any other available option would be a greater distance, and ID.y 5 can then take ID.x 59 at distance 4. However, ID.y 7 cannot be matched because ID.x 61 and ID.x 62 are equally close.

Example:

DT = data.table(
  ID.x = c("55", "55", "55", "55", "55", "55", "55", "56", "56", "56", "56", "56", "56", "56", "57", "57", "57", "57", "57", "57", "57", "58", "58", "58", "58", "58", "58", "58", "59", "59", "59", "59", "59", "59", "59", "60", "60", "60", "60", "60", "60", "60", "61", "61", "61", "61", "61", "61", "61", "62", "62", "62", "62", "62", "62", "62"),
  ID.y = c("1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7", "1", "2", "3", "4", "5", "6", "7"),
  distance = c("2", "3", "3", "4", "6", "6", "7", "2", "1", "2", "5", "5", "5", "6", "4", "4", "3", "5", "5", "5", "6", "5", "5", "5", "4", "4", "5", "6", "7", "7", "7", "6", "4", "6", "7", "6", "6", "6", "6", "4", "2", "5", "7", "7", "7", "7", "5", "5", "5", "6", "6", "6", "6", "4", "4", "5")
  )

Goal:

   ID.x ID.y distance
1:   55    1        2
2:   56    2        1
3:   57    3        3
4:   58    4        4
5:   59    5        4
6:   60    6        2
7:   NA    7        NA

This first attempt, inspired by this question, does not work:

DT[DT[, .I[distance == min(distance)], by=ID.x]$V1][DT[, .I[1], by = ID.y]$V1]

UPDATE: In response to the answers by @chinsoon12 and @paweł-chabros, here is an updated data.table that fixes a few things. It swaps x and y (my original question was matching x's with y's, but the more natural interpretation is y with x). This example removes the ambiguous matching for ID.y 7. In this example, the lowest distance matches ID.x 63. Separately, I also added a new ID.y 8, to clarify when no unambiguous match is possible (it matches ID.x 64 and 65 equally well). The answer should not select a match arbitrarily.

DT = data.table(
  ID.y = c("55", "55", "55", "55", "55", "55", "55", "55", "56", "56", "56", "56", "56", "56", "56", "56", "57", "57", "57", "57", "57", "57", "57", "57", "58", "58", "58", "58", "58", "58", "58", "58", "59", "59", "59", "59", "59", "59", "59", "59", "60", "60", "60", "60", "60", "60", "60", "60", "61", "61", "61", "61", "61", "61", "61", "61", "62", "62", "62", "62", "62", "62", "62", "62", "63", "63", "63", "63", "63", "63", "63", "63", "64", "64", "64", "64", "64", "64", "64", "64", "65", "65", "65", "65", "65", "65", "65", "65"),
  ID.x = c("1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8", "1", "2", "3", "4", "5", "6", "7", "8"),
  distance = c(2, 3, 3, 4, 6, 6, 7, 15, 2, 1, 2, 5, 5, 5, 6, 15, 4, 4, 3, 5, 5, 5, 6, 15, 5, 5, 5, 4, 4, 5, 6, 15, 7, 7, 7, 6, 4, 6, 7, 15, 6, 6, 6, 6, 4, 2, 5, 15, 7, 7, 7, 7, 5, 5, 6, 15, 6, 6, 6, 6, 4, 4, 10, 15, 11, 11, 11, 11, 11, 11, 5, 12, 11, 11, 11, 11, 11, 11, 11, 1, 11, 11, 11, 11, 11, 11, 11, 1)
  )

Expected Result:

   ID.y ID.x distance
1:   55    1        2
2:   56    2        1
3:   57    3        3
4:   58    4        4
5:   59    5        4
6:   60    6        2
7:   63    7        5
8:   NA    8        NA

I'm using this code is to complete a fuzzy join using stringdist_join, as described in this question. I have two datasets that need matching (hence the ID.x and ID.y). In my case, I have pre-test and post-test scores that need to be matched by multiple unreliable characteristics.

Kayle Sawyer
  • 549
  • 7
  • 22
  • I think that if the next result depends on the previous one then loop is needed. – Paweł Chabros Feb 20 '19 at 08:45
  • can you explain why ID.x 62 and ID.y 7 distance 5 is not feasible? – chinsoon12 Feb 20 '19 at 08:54
  • The issue with ID.y 7 is that it matches both ID.x 61 and ID.x 62 equally well (both are dist 5). In this example, I don't see any way to choose one over the other, except arbitrarily, so I think it would be best to keep ID.y 7 as NA. If we were to do it the other way around, selecting a unique ID.y for each ID.x, why wouldn't 61 be matched with 7 while 62 gets the NA? Again (in this reverse-direction scenario), it seems the best solution would be for both 61 and 62 to get no match from ID.y. – Kayle Sawyer Feb 21 '19 at 00:54

3 Answers3

1

Not clear to me why why ID.x 62 and ID.y 7 distance 5 is not feasible.

Assuming that ID.x 62, ID.y 7 and distance 5 is acceptable, a possible approach using data.table:

setorder(DT, distance)
choseny <- c()
ans <- DT[,
    {
        y <- setdiff(ID.y, choseny)[1L]
        choseny <- c(choseny, y)  
        .(ID.y=y, dist=.SD[ID.y==y, distance[1L]])
    },
    by=.(ID.x)]
setorder(ans, ID.x)[]

output:

   ID.x ID.y dist
1:   55    1    2
2:   56    2    1
3:   57    3    3
4:   58    4    4
5:   59    5    4
6:   60    6    2
7:   61 <NA> <NA>
8:   62    7    5
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
0

I am not sure if that's really the desired solution, but it should be helpful. Not super elegant, but it pretty much looks like the desired output.

 DT[, .(ID.y
     , distance
     , Row.Num = rank(distance)
     , Row.Num.ID = rank(ID.y)), by = list(ID.x)][, .SD[Row.Num == min(Row.Num) ], by = ID.x][, .SD[Row.Num.ID == min(Row.Num.ID) ], by = ID.x] 
 >  ID.x ID.y distance Row.Num Row.Num.ID
1:   55    1        2     1.0          1
2:   56    2        1     1.0          2
3:   57    3        3     1.0          3
4:   58    4        4     1.5          4
5:   59    5        4     1.0          5
6:   60    6        2     1.0          6
7:   61    5        5     2.0          5
8:   62    5        4     1.5          5
hannes101
  • 2,410
  • 1
  • 17
  • 40
  • The important part is the UNIQUE aspect. Each ID.x and ID.y must only appear once in the output, such that distance is the lowest possible for each pair. It has to be a one-to-one matching. – Kayle Sawyer Feb 21 '19 at 00:40
0

I don't know data.table well so I can give you only tidyverse solution. But maybe it will help you :)

library(tidyverse)

ID_y <- unique(DT$ID.y)

DT %>%
  as_tibble() %>%
  group_by(ID.x) %>%
  mutate(min_dist = min(distance)) %>%
  arrange(min_dist) %>%
  nest() %>%
  mutate(data = data %>% map(~ {
    min_row <- .x %>%
      filter(ID.y %in% ID_y) %>%
      filter(distance == min(distance)) %>%
      slice(1)
    ID_y <<- ID_y[ID_y != min_row$ID.y]
    min_row
  })) %>%
  unnest() %>%
  select(-min_dist) %>%
  arrange(ID.x)

I am saving all unique values of ID.y. Then I calculate minimum distance for all combinations and arrange by this minimum distance to tackle those ones at first in map loop. After filtering the minimum distance I remove ID.y from the vector, so other ID.x are searching only in ID.y's that left.

Paweł Chabros
  • 2,349
  • 1
  • 9
  • 12
  • This code works if you remove the slice(1), and then add the following three lines: `DT <- as.data.table(DT)` `DT[, num_matches := .N, by = .(ID.x)]` `DTunique <- unique(DT[num_matches > 1, ID.y := NA])` – Kayle Sawyer Feb 26 '19 at 04:42
  • However, I get the following error in my dataset: `Warning messages: 1: In ID_y != min_row$ID.y : longer object length is not a multiple of shorter object length` – Kayle Sawyer Feb 26 '19 at 04:57