12

Ok, this is a weird one. I suspect this is a bug inside data.table, but it would be useful if anyone can explain why this is happening - what is update doing exactly?

I'm using the list(list()) trick inside data.table to store fitted models. When you create a sequence of lm objects each for different groupings, and then update those models, the model data for all models becomes that of the last grouping. This seems like a reference is hanging around somewhere where a copy should have been made, but I can't find where and I can't reproduce this outside of lm and update.

Concrete example:

Starting with the iris data, first make the three species different sample sizes, then fit an lm model to each species, the update those models:

set.seed(3)
DT = data.table(iris)
DT = DT[rnorm(150) < 0.9]
fit = DT[, list(list(lm(Sepal.Length ~ Sepal.Width + Petal.Length))),
          by = Species]
fit2 = fit[, list(list(update(V1[[1]], ~.-Sepal.Length))), by = Species]

The original data table has different numbers of each species

DT[,.N, by = Species]
#       Species  N
# 1:     setosa 41
# 2: versicolor 39
# 3:  virginica 42

And the first fit confirms thsi:

fit[, nobs(V1[[1]]), by = Species]
#       Species V1
# 1:     setosa 41
# 2: versicolor 39
# 3:  virginica 42

But the updated second fit is showing 42 for all models

fit2[, nobs(V1[[1]]), by = Species]
#       Species V1
# 1:     setosa 42
# 2: versicolor 42
# 3:  virginica 42

We can also look at the model attribute which contains the data used for fitting, and see that all the model are indeed using the final groups data. The question is how has this happened?

head(fit$V1[[1]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# 3          4.7         3.2          1.3
# 4          4.6         3.1          1.5
# 5          5.0         3.6          1.4
# 6          5.4         3.9          1.7
head(fit$V1[[3]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3
head(fit2$V1[[1]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3
head(fit2$V1[[3]]$model)
#   Sepal.Length Sepal.Width Petal.Length
# 1          6.3         3.3          6.0
# 2          5.8         2.7          5.1
# 3          6.3         2.9          5.6
# 4          7.6         3.0          6.6
# 5          4.9         2.5          4.5
# 6          7.3         2.9          6.3
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Corvus
  • 7,548
  • 9
  • 42
  • 68
  • 1
    This behavior can be recreated a little more simply with this: `m1 <- fit$V1[[2]]; m2 <- update(m1, .~.)`; `m1` and `m2` are then different. Perhaps this will help in figuring it out. – Aaron left Stack Overflow Feb 26 '13 at 19:37
  • Thanks, I was trying for ages to simplify this a little – Corvus Feb 26 '13 at 20:12
  • 3
    Have you read http://stackoverflow.com/questions/13690184 - it may be related. – hadley Feb 26 '13 at 23:25
  • @hadley thanks - it definitely looks related. If update is looking for the objects in the "global" scope then I wonder where it is looking when it is inside a `data.table`. It is notable that it is NOT capturing the entire data.table column - only that of the final group. – Corvus Feb 27 '13 at 09:30
  • 1
    I've had a detailed look but I'm baffled too. Have filed it as [bug#2590](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2590&group_id=240&atid=975) for now and will come back to it. Well spotted by the way, and great question layout. – Matt Dowle Mar 01 '13 at 13:58
  • @MatthewDowle -- I've detailed some findings in an answer of sorts. Not sure about a data.table side solution though. – mnel Mar 13 '13 at 04:10

1 Answers1

6

This is not an answer, but is too long for a comment

The .Environment for the terms component is identical for each resulting model

e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE

It appears that data.table is using the same bit of memory (my non-technical term) for each evaluation of j by group (which is efficient). However when update is called, it is using this to refit the model. This will contain the values from the last group.

So, if you fudge this, it will work

fit = DT[, { xx <-list2env(copy(.SD))

             mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
             attr(mymodel$terms, '.Environment') <- xx
             list(list(mymodel))}, by= 'Species']





lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
      Species V1
1:     setosa 41
2: versicolor 39
3:  virginica 42

not a long term solution, but at least a workaround.

mnel
  • 113,303
  • 27
  • 265
  • 254