1

I would like to change a data.table by doing a join within a function. I understand that data.tables work by reference, so assumed that reassigning a joined version of a data.table to itself would change the original data.table. What simple thing have I misunderstood?

Thanks!

library('data.table')

# function to restrict DT to subset, by join
join_test <- function(DT) {
    test_dt     = data.table(a = c('a', 'b'), c = c('x', 'y'))
    setkey(test_dt, 'a')
    setkey(DT, 'a')

    DT  <- DT[test_dt]
}

DT  = data.table(a = c("a","b","c"), b = 1:3)
print(DT)
#    a b
# 1: a 1
# 2: b 2
# 3: c 3
haskey(DT)
# [1] FALSE

join_test(DT)
print(DT)
#    a b
# 1: a 1
# 2: b 2
# 3: c 3
haskey(DT)
# [1] TRUE

(haskey calls included just to double-check that some of the by reference changes work)

  • I think you are not changing DT in orginal data by `DT <- DT[test_dt]`, but assigning a new variable to DT. and this new DT is created inside the function so that It has nothing to do with the DT outside function scope. – mt1022 Oct 07 '16 at 13:40
  • I think you're right: `test_2 <- function(DT) { test_dt = data.table( a = c('a', 'b'), c = c('x', 'y') ) setkey(test_dt, 'a') setkey(DT, 'a') print(address(DT)) DT <- DT[test_dt] print(address(DT)) } DT = data.table(a = c("a","b","c"), b = 1:3) address(DT) test_2(DT)` – user3423201 Oct 07 '16 at 13:49
  • @dww Nah,it can assign by reference. I think the issue here is assignment inside the function call's scope (instead of to the parent environment). – Frank Oct 07 '16 at 13:50

2 Answers2

2

You can do it by reference, (since you can join and assign columns by reference based on the joined values, without actually saving the joined table back). However, you need to explicitly pick the columns you're after

join_test <- function(DT) {
    test_dt     = data.table(a = c('a', 'b'), c = c('x', 'y'))
    DT[test_dt, c := c, on = 'a'] 
}
Shape
  • 2,892
  • 19
  • 31
  • Hi, although this seems like a 'data.table' way of doing things, it doesn't work at all - even when I add in the setkey commands necessary within the function, I get this error: ` Error in `[.data.table`(DT, test_dt, by = "a", `:=`(c, c)): invalid type/length (builtin/3) in vector allocation`. Also fails if I swap around the by and := expressions to the normal way around. – user3423201 Oct 07 '16 at 14:04
  • Although using `DT[test_dt, c := c, on = 'a']` works better - this joins the additional column, but doesn't enforce matching on `a`, so I get all rows but some with NA in the `c` column. – user3423201 Oct 07 '16 at 14:09
  • @user3423201 quite right about the on, I added `by` at the end (merge syntax) without testing. You expect to see NA in the c column, you cannot add or remove rows by reference – Shape Oct 07 '16 at 14:30
  • no adding or removing rows by reference is the critical thing here - thanks. – user3423201 Oct 07 '16 at 14:36
  • Link for no row changes by reference: http://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table – user3423201 Oct 07 '16 at 14:37
1

Having your function return the data table and storing the result in DT will get you what you want.

join_test <- function(DT) {
  test_dt     = data.table(a = c('a', 'b'), c = c('x', 'y'))
  setkey(test_dt, 'a')
  setkey(DT, 'a')

  DT  <- DT[test_dt]

  return(DT)
}

DT  = data.table(a = c("a","b","c"), b = 1:3)

DT <- join_test(DT)
print(DT)
#    a b c
# 1: a 1 x
# 2: b 2 y
MeetMrMet
  • 1,349
  • 8
  • 14
  • Yes, that does solve the problem, thank you! But it doesn't help me understand how I could do a join within a function by reference, which is really why I was asking the question. It feels like there should be a more 'data.table' way of doing this. – user3423201 Oct 07 '16 at 13:47