What is the difference between self-reference and update by reference in join?

Question

Can someone explain to in layman terms what is the difference between these two approachs besides order

A <- data.table(id = letters[1:10], amount = 1:10)
B <- data.table(id = c("c", "d", "e"), comment = c("big", "slow", "nice"))

A <- B[A, on = .(id), mult = 'first']
format(object.size(A),units='b')
A

A <- data.table(id = letters[1:10], amount = 1:10)
B <- data.table(id = c("c", "d", "e"), comment = c("big", "slow", "nice"))

A[, comment := B[A, on=.(id), x.comment, mult = 'first']]
A
format(object.size(A), units = 'b')

I use setfunctions quite often in data.table to update and modify data, but I couldn't understand what is the real advantage of doing it in a join. What happens internally when I join and assign to the same object? Is it the same of modify in place the original data.table or is it making some copy?

I already read this topic "update by reference" vs shallow copy and data.table vignettes but I'm still not understanding it.

Edit: I don't know if this is the way to track time of it, but looks like the second approach is a lot more faster than the first one with 10^6repplications of table A

First approach

Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
A <- B[A, on = .(id), mult = "first"] 856.9123 5120.108 13495.41 9702.625 18861.52 70319.84   100

Second approach

Unit: milliseconds
                                                            expr      min       lq     mean   median
 A[, `:=`(comment, B[A, on = .(id), x.comment, mult = "first"])] 471.6508 612.1226 627.4387 625.0439
       uq      max neval
 641.7865 753.1218   100

If the above is right there is a huge advantage of using the second method. Is it just because the first method is making a copy after the join? How R manages this copies since I'm assigning to the same object?

the first one creates a new data.table and then assigns it back to one of the data.table. the second one joins, extracts a column and then add this to A which has some additional columns *pre-allocated*. sometimes for the 2nd one you can also update some of the rows selectively in i and changing that A to .SD. besides storage space, you might also want look at speed and peak memory usage when number of rows gets large. it really depends on your use case. — chinsoon12, Jul 29 '21 at 01:48
tbh, i have no idea how to measure it accurately. there are some packages that can do that but i read from some comments that they are not accurate — chinsoon12, Jul 29 '21 at 03:53
Some more thoughts on pros/cons here: https://stackoverflow.com/a/54313203 Re how R manages the copies, I think it creates the new table somewhere in memory and then sets the pointer for `A` towards that location. — Frank, Jul 29 '21 at 04:56

What is the difference between self-reference and update by reference in join?

0 Answers0