11

I have a data.table which I want to split into two. I do this as follows:

dt <- data.table(a=c(1,2,3,3),b=c(1,1,2,2))
sdt <- split(dt,dt$b==2)

but if I want to to something like this as a next step

sdt[[1]][,c:=.N,by=a]

I get the following warning message.

Warning message: In [.data.table(sdt[[1]], , :=(c, .N), by = a) : Invalid .internal.selfref detected and fixed by taking a copy of the whole table, so that := can add this new column by reference. At an earlier point, this data.table has been copied by R. Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects), use reflist() instead if needed (to be implemented). If this message doesn't help, please report to datatable-help so the root cause can be fixed.

Just wondering if there is a better way of splitting the table so that it would be more efficient (and would not get this message)?

jamborta
  • 5,130
  • 6
  • 35
  • 55

3 Answers3

11

This works in v1.8.7 (and may work in v1.8.6 too) :

> sdt = lapply(split(1:nrow(dt), dt$b==2), function(x)dt[x])
> sdt
$`FALSE`
   a b
1: 1 1
2: 2 1

$`TRUE`
   a b
1: 3 2
2: 3 2

> sdt[[1]][,c:=.N,by=a]     # now no warning
> sdt
$`FALSE`
   a b c
1: 1 1 1
2: 2 1 1

$`TRUE`
   a b
1: 3 2
2: 3 2

But, as @mnel said, that's inefficient. Please avoid splitting if possible.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 1
    I don't quite understand why it says `invalid .internal.selfref` as when I do `attributes(sdt[[1]])$.internal.selfref`, the value seems to be the same as the one for `dt` (and same on `dt2 <- copy(dt)`).. Any thoughts? – Arun Feb 20 '13 at 11:29
  • 3
    @Arun Exactly, that's why it's invalid. It's supposed to point to itself when valid. If you look at `.Internal(inspect(sdt[[1]]))` you should see its pointer address is different (a copy was taken). That's what `.internal.selfref` is designed to detect. The problem with the copy isn't so much the copy per se, but that when R does that copy it doesn't maintain the over allocated vector of column pointers. Hence the warning when `:=` tries to add a new column (it has to over-allocate again) in case you have two bindings to the same object. All correct and intended. – Matt Dowle Feb 20 '13 at 11:38
  • 2
    @Arun So the warning is trying to say: don't `base::split` find some other way, such as my answer, to do the split. – Matt Dowle Feb 20 '13 at 11:45
  • 2
    @Arun There is also `data.table:::selfrefok(sdt[[1]])` which checks whether `.internal.selfref` is valid or not. Returns 0/1. Deliberately not exported as it's just intended for debugging/inspecting. – Matt Dowle Feb 20 '13 at 11:48
4

I was looking for some way to do a split in data.table, I came across this old question.

Sometime a split is what you want to do, and the data.table "by" approach is not convenient.

Actually you can easily do your split by hand with data.table only instructions and it works very efficiently:

SplitDataTable <- function(dt,attr) {
  boundaries=c(0,which(head(dt[[attr]],-1)!=tail(dt[[attr]],-1)),nrow(dt))
  return(
    mapply(
      function(start,end) {dt[start:end,]},
      head(boundaries,-1)+1,
      tail(boundaries,-1),
      SIMPLIFY=F))
}
haltux
  • 61
  • 4
3

As mentionned above (@jangorecki), the package data.table already has its own function for splitting. In that simplified case we can use:

> dt <- data.table(a = c(1, 2, 3, 3), b = c(1, 1, 2, 2))
> split(dt, by = "b")
$`1`
   a b
1: 1 1
2: 2 1

$`2`
   a b
1: 3 2
2: 3 2

For more difficult/concrete cases, I would recommend to create a new variable in the data.table using the by reference functions := or set and then call the function split. If you care about performance, make sure to always remain in the data.table environment e.g., dt[, SplitCriteria := (...)] rather than computing the splitting variable externallly.