6

Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? Please see my example below and the way I usually do it. However, this seems rather inefficient.

Example

library(data.table)

# initial data
DT = data.table(x = c(1,1,1,2,2,1,1,2,2), 
                y = c(1,3,6))

# data to merge
DTx <- data.table(x = 1:3,
                  y = 1,
                  k = "X")

# regular update join
copy(DT)[DTx,
         on = .(x, y),
         k := i.k][]
#>    x y    k
#> 1: 1 1    X
#> 2: 1 3 <NA>
#> 3: 1 6 <NA>
#> 4: 2 1    X
#> 5: 2 3 <NA>
#> 6: 1 6 <NA>
#> 7: 1 1    X
#> 8: 2 3 <NA>
#> 9: 2 6 <NA>

# DTx remains the same
DTx
#>    x y k
#> 1: 1 1 X
#> 2: 2 1 X
#> 3: 3 1 X

What I usually do:

# set an Id variable
DTx[, Id := .I]

# assign the Id in merge
DT[DTx,
   on = .(x, y),
   `:=`(k = i.k,
        matched_id = i.Id)][]
#>    x y    k matched_id
#> 1: 1 1    X          1
#> 2: 1 3 <NA>         NA
#> 3: 1 6 <NA>         NA
#> 4: 2 1    X          2
#> 5: 2 3 <NA>         NA
#> 6: 1 6 <NA>         NA
#> 7: 1 1    X          1
#> 8: 2 3 <NA>         NA
#> 9: 2 6 <NA>         NA

# use matched_id to find merged rows
DTx[, matched := fifelse(Id %in% DT$matched_id, TRUE, FALSE)]
DTx
#>    x y k Id matched
#> 1: 1 1 X  1    TRUE
#> 2: 2 1 X  2    TRUE
#> 3: 3 1 X  3   FALSE
mnist
  • 6,571
  • 1
  • 18
  • 41
  • I've not seen any merge/join mechanism that supports modifying *both* frames in-place, even with `data.table`'s referential semantics. I think your use of `matched := ...` is the best I can think of at the moment. – r2evans Dec 02 '21 at 16:37
  • 2
    (Though frankly there's no need here for `fifelse` ... just `matched := Id %in% DT$matched_id` will suffice.) – r2evans Dec 02 '21 at 17:04
  • good point! and thank you for your thoughts – mnist Dec 02 '21 at 17:22
  • thanks for hinting on which. However, it only returns the rows in `DT`, hence I still would need to join twice which is often less efficient than my regular alternative – mnist Dec 02 '21 at 19:09
  • 2
    To have matches on both sides you need to access internal function. As of now I am not sure if there is one that you can obtain that nicely. Look into merglist PR https://github.com/Rdatatable/data.table/pull/4370 see dtmerge function. – jangorecki Dec 02 '21 at 22:52
  • This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables. – jangorecki Dec 02 '21 at 22:55
  • thank you, a very interesting PR! How would your solution (dummy code) look like? And just out of curiosity: How difficult would it be to implement the feature I was looking for? My guess would have been that this information would be available somewhere during joining but is just not used? Or, based on your profound knowledge, this would require a completely different join-logic. Thinking about a PR by myself... – mnist Dec 02 '21 at 23:08
  • Other than this instance by mnist, I would be curious to know if and/or how frequently others have had the same need to update both frames symmetrically/simultaneously. – r2evans Dec 03 '21 at 00:35
  • @r2evans I've flagged rows matched in a join anyway (the specific update-two-tables case here) and here's another Q with the same goal https://stackoverflow.com/questions/43660562/find-matches-to-several-tables-conditional-full-join-using-data-table I usually count matches instead of using a true/false flag as part of debugging (eg, there's a problem when I expect a 1:1 match and find some zeros or 2s). – Frank Dec 03 '21 at 06:44
  • @mnist I voted to leave this question open: [link to review](https://stackoverflow.com/review/close/30715570). It needs one more vote to leave open from a reviewer to bump it out of the close vote review queue. – karel Jan 02 '22 at 09:45

1 Answers1

1

Following Jan's comment:

This will provide you indices of matching rows but you will have to call merge again to perform actual merging, unless you manually use provided indices to match/update those tables.

You can pull the indices:

merge_metaDT = DT[DTx, on=.(x, y), .(irow = .GRP, xrow = .I), by=.EACHI]

   x y irow xrow
1: 1 1    1    1
2: 1 1    1    7
3: 2 1    2    4
4: 3 1    3    0

Then apply edits to each table using indices rather than merging or matching a second time:

rowDT = merge_metaDT[xrow != 0L]
DT[rowDT$xrow, k := DTx[rowDT$irow, k]]
DTx[, matched := FALSE][rowDT$irow, matched := TRUE]

How it works:

  • When joining, x[i], the symbol .I indexes rows of x
  • When grouping in a join with by=.EACHI, .GRP indexes each group, which means each row of i here
  • We drop the non-matching values of .I which are coded as zeros

On this last point, we might expect NAs instead of zeros, as returned by DT[DTx, on=.(x, y), which=TRUE]. I'm not sure why these differ.


Suppose I would like to track which rows from one data.table were merged to another data.table. is there a way to do this at once/while merging? [...] seems rather inefficient.

I expect this is more efficient than multiple merges or %in% when the merge is costly enough.

It still requires multiple steps. I doubt there's any way around that, since it would be hard to come up with logic and syntax for the update that is easy to follow.

Update logic is already complex in base R, with multiple edits on a single index allowed:

> x = c(1, 2, 3)
> x[c(1, 1)] = c(4, 5)
> x
[1] 5 2 3

And there is the question of how to match and edit multiple indices at once:

> x = c(1, 1, 3)
> x[match(c(1, 3), x)] = c(4, 5)
> x
[1] 4 1 5

In data.table updates, the latter issue is handled with mult=. In the update-two-tables use case, these questions would get much more complicated.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Thank you for our answer and sorry for just upvoting, not commenting. This approach is very helpful! It is just not the kind solution I was looking for but rather a (more efficient) workaround. Will not accept it therefore since there might be a "real" solution anytime in the future. – mnist Dec 08 '21 at 11:45