14

I needed to assign a "second" id to group some values inside my original id. this is my sample data:

dt<-structure(list(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                   period = c("start", "end", "start", "end", "start", "end"),
                   date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))),
              class = c("data.table", "data.frame"),
              .Names = c("id", "period", "date"),
              sorted = "id")
> dt
     id period       date
1: aaaa  start 2012-03-02
2: aaaa    end 2012-03-05
3: aaas  start 2012-08-21
4: aaas    end 2013-02-25
5: bbbb  start 2012-03-31
6: bbbb    end 2013-02-11

column id needs to be grouped (using the same value in say id2) according to this list:

> groups
[[1]]
[1] "aaaa" "aaas"

[[2]]
[1] "bbbb"

I used the following code, which seems to work by gives the following warning:

    > dt[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
    Warning message:
    In `[.data.table`(dt, , `:=`(id2, which(vapply(groups, function(x,  :
      Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table has
been copied by R (or been created manually using structure() or similar). Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use
set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also,
list (DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
    > dt
         id period       date id2
    1: aaaa  start 2012-03-02   1
    2: aaaa    end 2012-03-02   1
    3: aaas  start 2012-08-29   1
    4: aaas    end 2013-02-26   1
    5: bbbb  start 2012-03-31   2
    6: bbbb    end 2013-02-11   2

could someone briefly explain the nature of this warning and any eventual implication in the final results (if any)? thanks

EDIT:

the follwing code is actually showing when dt is created and how is passed to the function that gives the warning:

f.main <- function(){
      f2 <- function(x){
      groups <- list(c("aaaa", "aaas"), "bbbb") # actually generated depending on the similarity between values of x$id
      x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
      return(x)
  }
  x <- f1()
  if(!is.null(x[["res"]])){
    x <- f2(x[["res"]])
    return(x)
  } else {
    # something else
  }
}

f1 <- function(){
  dt<-data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                 period = c("start", "end", "start", "end", "start", "end"),
                 date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date")))
  return(list(res=dt, other_results=""))
}

> f.main()
     id period       date id2
1: aaaa  start 2012-03-02   1
2: aaaa    end 2012-03-02   1
3: aaas  start 2012-08-29   1
4: aaas    end 2013-02-26   1
5: bbbb  start 2012-03-31   2
6: bbbb    end 2013-02-11   2
Warning message:
In `[.data.table`(x, , `:=`(id2, which(vapply(groups, function(x,  :
  Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table
has been copied by R (or been created manually using structure() or similar).
Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr().
Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
Michele
  • 8,563
  • 6
  • 45
  • 72
  • 3
    The warning says it: "created manually using structure() or similar". Create your data.table using function `data.table`. However, it's only a warning and you shouldn't run into major problems (other than slower performance). Also, you can substitute `.BY[[1]]` with `id`. – Roland Jun 16 '13 at 13:04
  • @Roland thanks for reply but in the real case the table is not crated via `structure`. that is just the (modified) output from `print(dput(x))` I use to know what was going on to the table inside my program. Just double check, `dt` is genereted via `data.table` in a function, return()ed to the main function which passes it to another function as parameter, and here the `warning` happens – Michele Jun 16 '13 at 13:20
  • Well, make your code representative of your real problem. Show us how you pass the DT between functions. – Roland Jun 16 '13 at 14:03
  • @Roland probably it's related to `list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects), use reflist() instead if needed (to be implemented)` but I'm not sure to have got the meaning of this sentence (i.e. how putting `dt` in a list creates an invalid reference) – Michele Jun 16 '13 at 15:10
  • 3
    fwiw, a shorter expression to achieve the above is `dt[melt(groups)]`, using `reshape2::melt` – eddi Jun 16 '13 at 17:54
  • it's worth a great thank you :-). I'm a "fanatic" of `melt`, but somehow I missed to see its usage in there this morning. – Michele Jun 16 '13 at 18:23

3 Answers3

12

Yes, the problem is the list. Here is a simple example:

DT <- data.table(1:5)
mylist1 <- list(DT,"a")
mylist1[[1]][,id:=.I]
#warning

mylist2 <- list(data.table(1:5),"a")
mylist2[[1]][,id:=.I]
#no warning

You should avoid copying a data.table into a list (and to be on the safe side I would avoid having a DT in a list at all). Try this:

f1 <- function(){
  mylist <- list(res=data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
                 period = c("start", "end", "start", "end", "start", "end"),
                 date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))))
  other_results <- ""
  mylist$other_results <- other_results
  mylist
}
Roland
  • 127,288
  • 10
  • 191
  • 288
  • thanks! it's also possible to remove the `warning` doing: `dt <- copy(mylist1[[1]])` – Michele Jun 16 '13 at 16:41
  • 2
    Of course, but with data.tables, which are usually large, the goal is to avoid copies. That's one of the main benefits of the package. – Roland Jun 16 '13 at 16:43
  • 1
    I know but `dt` is no more than 10 rows. My program checks lots of different data flows (each of them in a `dt`) performs operations, alerts, job creations, for each customers by sub-setting these `data.table`s and calling different functions (like that one above, `group` is the result of a fun checking for typos) depending on the flow type. I use `dt` mainly because at a certain point these small tables will join with a huge one(50M+) and I like doing `joins` with `data.table`! in particular for the `roll` option, which I use all the time! – Michele Jun 16 '13 at 16:53
  • do you mean memory? In each run of the `main` there are probably 15 to 20 of these copy. But for the next customer there is a new function call, (the main is called by `adply` against a cust.detail table for each row) , so I thought that (besides the value `main` returns) every is cancelled and no copy left around. Am I right? – Michele Jun 16 '13 at 17:03
  • 1
    No, I mean speed. Making copies costs time. If that is relevant, depends on your use case. – Roland Jun 16 '13 at 17:08
  • ok thanks again. Speed is definitely important, I'll to 'reshape' some function calls. you've been more than helpful!! thanks a million! – Michele Jun 16 '13 at 17:15
  • I tried running the simple example. Strangely, neither part generated a warning message. – Ye Tian Sep 04 '17 at 18:28
  • This answer is several years old. The package has been continuously developed and improved. – Roland Sep 04 '17 at 19:28
12

You could "shallow copy" while creating the list, so that 1) you don't do full memory copy (speed isn't affected) and 2) you don't get internal ref error (thanks to @mnel for this trick).

Creating data:

set.seed(45)
ss <- function() {
    tt <- sample(1:10, 1e6, replace=TRUE)
}
tt <- replicate(100, ss(), simplify=FALSE)
tt <- as.data.table(tt)

How you should go about creating the list (shallow copy):

system.time( {
    ll <- list(d1 = { # shallow copy here...
        data.table:::settruelength(tt, 0)
        invisible(alloc.col(tt))
    }, "a")
})
user  system elapsed
   0       0       0
> system.time(tt[, bla := 2])
   user  system elapsed
  0.012   0.000   0.013
> system.time(ll[[1]][, bla :=2 ])
   user  system elapsed
  0.008   0.000   0.010

So you don't compromise in speed and you don't get a warning followed by a full copy. Hope this helps.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    Even though I had already the answer (the reason of the warning) this provides the best and more general (since I may need to create `dt` and _then_ put it in a list) way to work with `data.table` objects inside and outside `list`s. I wish I could upvote more than once :-) – Michele Jun 20 '13 at 09:45
6

"Invalid .internal.selfref detected and fixed by taking a copy..."

No need to make a copy when assigning id2 within f2() you can add a column directly by altering:

# From:

      x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]

# To something along the lines of:
      x$id2 <- findInterval( match( x$id, unlist(groups)), cumsum(c(0,sapply(groups, length)))+1)

Then you can continue use your 'x' data.table like normal without incurring a warning.

Also, to simply suppress the warning you can use suppressWarnings() around the f2(x[["res"]]) call.

Even on small tables there can be substantial performance difference:

Performance Comparison:
Unit: milliseconds
                       expr      min       lq   median       uq      max neval
                   f.main() 2.896716 2.982045 3.034334 3.137628 7.542367   100
 suppressWarnings(f.main()) 3.005142 3.081811 3.133137 3.210126 5.363575   100
            f.main.direct() 1.279303 1.384521 1.413713 1.486853 5.684363   100
Thell
  • 5,883
  • 31
  • 55
  • thanks for this option. I'll check the performance of your method. – Michele Jun 16 '13 at 17:22
  • interesting, so `findInterval`+`match` is 2X faster than `vapply`+`==`. That was helpful, a lot. Thanks! – Michele Jun 16 '13 at 18:15
  • Credit where credit is due - upvote the answer where I first learned of this:: http://stackoverflow.com/a/11002456/173985 – Thell Jun 16 '13 at 18:25
  • already upvoted your answer :-) I accepted the one of @Roland because it answered my actual question: why did I get the warning. – Michele Jun 16 '13 at 18:30