R - using glm inside a data.table

Question

I'm trying to do some glm's inside a data.table to produce modelled results split by key factors.

I've been doing this sucessfully for:

High level glm

glm(modellingDF,formula=Outcome~IntCol + DecCol,family=binomial(link=logit))
Scoped glm with single columns

modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol ,family=binomial(link=logit))$fitted ), by=variable]
Scoped glm with two integer columns

modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol + IntCol2 ,family=binomial(link=logit))$fitted ), by=variable]

But, when I try and do the high level glm inside the scope with my decimal column, it produces this error

Error in model.frame.default(formula = Outcome ~ IntCol + DecCol, data = x,  : 
  variable lengths differ (found for 'DecCol')

I thought perhaps it was due to variable lengths of the partitions, so I tested with a reproducible example:

library("data.table")

testing<-data.table(letters=sample(rep(LETTERS,5000),5000),
                    letters2=sample(rep(LETTERS[1:5],10000),5000), 
                    cont.var=rnorm(5000),
                    cont.var2=round(rnorm(5000)*1000,0),
                    outcome=rbinom(5000,1,0.8)
                    ,key="letters")
testing.glm<-testing[,list(outcome,
                  fitted=glm(x,formula=outcome~cont.var+cont.var2,family=binomial(link=logit))$fitted)
        ),by=list(letters)]

But this did not have the error. I thought maybe it was due to NAs or something but a summary of the data.table modellingDF gives no indication that there should be any issues:

DecCol
Min.   :0.0416
1st Qu.:0.6122
Median :0.7220
Mean   :0.6794
3rd Qu.:0.7840
Max.   :0.9495

nrow(modellingDF[is.na(DecCol),])   # results in 0

modellingDF[,list(len=.N,DecCollen=length(DecCol),IntCollen=length
(IntCol ),Outcomelen=length(Outcome)),by=Bracket]

  Bracket  len DecCollen IntCollen Outcomelen
1:     3-6 39184  39184       39184      39184
2:     1-2 19909  19909       19909      19909
3:       0  9912   9912        9912       9912

Perhaps I'm having a dozy day, but could anyone suggest a solution or a means for digging into this issue further?

NAs? [R Variable Length Differ when build linear model for residuals](http://stackoverflow.com/questions/14924541/r-variable-length-differ-when-build-linear-model-for-residuals) — zx8754, Sep 25 '13 at 10:18
I considered it, but `sapply(modellingDF, function(x) all(is.na(x)))` returns FALSE for every column — Steph Locke, Sep 25 '13 at 10:22
Can you make a reproducible example that produces the error? You've shown the error which is good, but not what produces it, iiuc. — Matt Dowle, Sep 25 '13 at 10:33
I was just trying to add a reproducible example with some dput results for you and noticed something rather strange - when I change the columns name away from the actual column name it actually works. `modellingDF[sample(1:nrow(modellingDF),200),list(IntCol=Age,IntCol2=Score,Outcome,abc=LTV),key=Bracket]` works, but `modellingDF[sample(1:nrow(modellingDF),200),list(IntCol=Age,IntCol2=Score,Outcome,LTV),key=Bracket]` won't. I thought perhaps I had a variable called LTV but nope, plus data.table should take internal variables in preference — Steph Locke, Sep 25 '13 at 10:56
What is `x` in your example? (ie `glm(x, formula = ...)`. Generally you need to reference `.SD` as the `data` argument for the correct environment to be used. — mnel, Sep 25 '13 at 11:04
I thought x was the carry through alias for the subsets of data.table i.e. used here for putting the data into the glm for each Bracket value. I might be wholly wrong though! — Steph Locke, Sep 25 '13 at 11:13
ah, `.SD` will go away and try that as, like you say, there may be some issues with the X value. Cheers for the clear up! — Steph Locke, Sep 25 '13 at 11:20
@mnel, if you add an answer to the effect that I should be using .SD as opposed to x, to correctly pass values through then I'll mark your answer as fixing the issue — Steph Locke, Sep 25 '13 at 11:26

score 8 · Accepted Answer · edited May 23 '17 at 11:52

You need to correctly specify the data argument within glm. Inside a data.table (using [), this is referenced by .SD. (see create a formula in a data.table environment in R for related question)

So

modellingDF[,list(Outcome, fitted = glm(data = .SD, 
  formula = Outcome ~ IntCol ,family = binomial(link = logit))$fitted),
 by=variable]

will work.

While in this case (simply extracting the fitted values and moving on), this approach is sound, using data.table and .SD can get in a mess of environments if you are saving the whole model and then attempting to update it (see Why is using update on a lm inside a grouped data.table losing its model data?)

This answer is a bit outdated. `modellingDF[,.(Outcome, fitted = glm(Outcome ~ IntCol ,family = binomial)$fitted), by=variable]` should work and is much cleaner. — MichaelChirico, Dec 21 '15 at 04:23

score 0 · Answer 2 · answered Oct 22 '20 at 09:50

In addition to @mnel's answer, you can avoid problems with NAs in your data by using the appropriate function to extract fitted values and specifying the appropriate na.action in glm:

modellingDF[, list(Outcome, fitted = 
   fitted(glm(data = .SD, 
       formula = Outcome ~ IntCol ,
       family = binomial(link = logit),
       na.action=na.exclude)
   ), by=variable]

This will return an object with fitted values of the same size as the original data, preserving the NAs but excluding them from the model estimation.

R - using glm inside a data.table

2 Answers2

Linked