2

I am trying to merge two data.tables that result in multiple matches, to resolve this I want to select at random one of the multiple matches.

Say i have dt1 as such. Note, dt1 has 100+ columns but for demonstration purposes I've only written out three.

dt1 = data.table(x1 = c(100, 100, 200, 200, 200), 
                 x2 = c(101, 102, 201, 202, 203), 
                 v1 = c(1,2,3,4,5),
                 ...
                 vn = ...)

    x1  x2 v1
1: 100 101  1
2: 100 102  2
3: 200 201  3
4: 200 202  4
5: 200 203  5

Similarly dt2 looks like such. It also has more columns than what is shown below:

dt2 = data.table(x1 = c(100, 100, 100, 200, 200),
                 y1 = c(1, 2, 3, 4, 5),
                 ...)

    x1 y1
1: 100  1
2: 100  2
3: 100  3
4: 200  4
5: 200  5

What I want to do is merge on dt1 onto dt2 using column x1 to result in something like this:

    x1  x2 v1 y1
1: 100 101  1  1
2: 100 101  1  2
3: 100 102  2  3
4: 200 201  3  4
5: 200 203  5  5

This solution is very close to what I want Join data frames and select random row when there are multiple matches however, since both dt1 and dt2 has so many columns I cannot list out every single column (regardless of which way I want to merge dt1 to dt2 or dt2 to dt1).

starja
  • 9,887
  • 1
  • 13
  • 28
mattidore
  • 23
  • 4
  • Is this link helful to you: https://stackoverflow.com/questions/24191497/left-join-only-selected-columns-in-r-with-the-merge-function – Ravi Saroch Oct 30 '20 at 04:40
  • 2
    What about if you `sample` the rows in `dt2` first and then use `mult="first"` to take the first row after scrambling? Does that meet your requirement for randomness? `dt1[dt2[sample(.N)], on = .(x1), mult="first"]` – thelatemail Oct 30 '20 at 04:49
  • @thelatemail thanks for your answer. It's close but not quite. What I need is for each `x1` match to have a chance at merging on either `x2 = 101` or `x2 = 102`. Similarly for each `x2` match to have a chance at merging one of 201, 202, or 203. The end result would be that every duplicate match does not have all the same rows. – mattidore Nov 01 '20 at 03:11
  • I see now in the link to the other answer that I posted there is some development around enhanced functionality of the `mult` argument in order to accept `mult = 'random'` which I think is exactly what I need in this case. – mattidore Nov 01 '20 at 03:20
  • Hmmm, what about `dt1[dt2[, sample(.N, replace=TRUE) , by=x1], on = .(x1), mult="first"]` where you sample with replacement within each `x1` group and then merge? That should make it a totally random choice from each set of possible values, with reuse of the values allowed too. – thelatemail Nov 02 '20 at 01:29

0 Answers0