4

I am struggling again to understand how the mult argument is working when performing an update-on-join. What I am trying to do is to implement a left-join as defined in lj.

For performance reasons I'd like to update the left table

The "un-trivial" part is that when the left table and the right table have a column in common, (not considering the join columns), I'd like to use the first value in the right table to override the value of the left table. I thought mult would help me dealing with this multiple match issue but I cannot get it right

library(data.table)
X <- data.table(x = c("a", "a", "b", "c", "d"), y = c(0, 1, 1, 2, 2), t = 0:4)
X                 
#        x     y     t
#   <char> <num> <int>
#1:      a     0     0
#2:      a     1     1
#3:      b     1     2
#4:      c     2     3
#5:      d     2     4

Y <- data.table(xx = c("f", "b", "c", "c", "e", "a"), y = c(2, NA, 3, 4, 5, 6), u = 2:7)
Y                 
#       xx     y     u
#   <char> <num> <int>
#1:      f     2     2
#2:      b    NA     3
#3:      c     3     4
#4:      c     4     5
#5:      e     5     6
#6:      a     6     7

# Expected result
#        x     y     t                                                 
#   <char> <num> <int>                                                 
#1:      a     6     0    <= single match on xx == "a" so Y[xx == "a", y] is used                                                
#2:      a     6     1    <= single match on xx == "a" so Y[xx == "a", y] is used                                             
#3:      b    NA     2    <= single match on xx == "b" so Y[xx == "b", y] is used                                             
#4:      c     3     3    <= mult match on xx == "c" so Y[xx == "c", y[1L]] is used                                             
#5:      d     2     4    <= no xx == "d" in Y so nothing changes


copy(X)[Y, y := i.y, by = .EACHI, on = c(x = "xx"), mult = "first"][]
#        x     y     t                                                   
#   <char> <num> <int>                                                   
#1:      a     6     0                                                   
#2:      a     1     1   <= a should always have the same value ie 6                                                
#3:      b    NA     2                                                   
#4:      c     4     3   <= y == 4 is not the first value of y in the Y table                                                
#5:      d     2     4                                                   
    
# Using mult = "all" is the closest I get from the right result
copy(X)[Y, y := i.y, by = .EACHI, on = c(x = "xx"), mult = "all"][]
#        x     y     t                                                 
#   <char> <num> <int>                                                 
#1:      a     6     0                                                 
#2:      a     6     1                                                 
#3:      b    NA     2                                                 
#4:      c     4     3    <= y == 4 is not the first value of y in the Y table                                             
#5:      d     2     4  

Can someone explain to me what's wrong in the above ?

I guess I could use Y[X, ...] to get to what I want, the issue is that X is very large and the performance I get is much worse using Y[X, ...]

statquant
  • 13,672
  • 21
  • 91
  • 162

2 Answers2

4

I'd like to use the first value in the right table to override the value of the left table

Select the first values and update with them alone:

X[unique(Y, by="xx", fromLast=FALSE), on=.(x=xx), y := i.y]

   x  y t
1: a  6 0
2: a  6 1
3: b NA 2
4: c  3 3
5: d  2 4

fromLast= can select the first or last row when dropping dupes.


How multiple matches are handled:

In x[i, mult=], if a row of i has multiple matches, mult determines which matching row(s) of x are selected. This explains the results shown in the OP.

In x[i, v := i.v], if multiple rows of i match to the same row in x, all of the relevant i-rows write to the x-row sequentially, so the last i-row gets the final write. Turn on verbose output to see how many edits are made in an update -- it will exceed the number of x rows in this case (because the rows are edited repeatedly):

options(datatable.verbose=TRUE)
data.table(a=1,b=2)[.(a=1, b=3:4), on=.(a), b := i.b][]
# Assigning to 2 row subset of 1 rows
   a b
1: 1 4
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Thanks, I saw this on your https://franknarf1.github.io/r-tutorial/_book/tables.html but why the result changed by changing mult was disconcerting – statquant Jul 07 '21 at 12:23
2

mult is always equal to "last" in case of update on join with :=

I recall it was described somewhere in documentation.

jangorecki
  • 16,384
  • 4
  • 79
  • 160