6

I'm trying to do some glm's inside a data.table to produce modelled results split by key factors.

I've been doing this sucessfully for:

  • High level glm

    glm(modellingDF,formula=Outcome~IntCol + DecCol,family=binomial(link=logit))

  • Scoped glm with single columns

    modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol ,family=binomial(link=logit))$fitted ), by=variable]

  • Scoped glm with two integer columns

    modellingDF[,list(Outcome, fitted=glm(x,formula=Outcome~IntCol + IntCol2 ,family=binomial(link=logit))$fitted ), by=variable]

But, when I try and do the high level glm inside the scope with my decimal column, it produces this error

Error in model.frame.default(formula = Outcome ~ IntCol + DecCol, data = x,  : 
  variable lengths differ (found for 'DecCol')

I thought perhaps it was due to variable lengths of the partitions, so I tested with a reproducible example:

library("data.table")

testing<-data.table(letters=sample(rep(LETTERS,5000),5000),
                    letters2=sample(rep(LETTERS[1:5],10000),5000), 
                    cont.var=rnorm(5000),
                    cont.var2=round(rnorm(5000)*1000,0),
                    outcome=rbinom(5000,1,0.8)
                    ,key="letters")
testing.glm<-testing[,list(outcome,
                  fitted=glm(x,formula=outcome~cont.var+cont.var2,family=binomial(link=logit))$fitted)
        ),by=list(letters)]

But this did not have the error. I thought maybe it was due to NAs or something but a summary of the data.table modellingDF gives no indication that there should be any issues:

DecCol
Min.   :0.0416
1st Qu.:0.6122
Median :0.7220
Mean   :0.6794
3rd Qu.:0.7840
Max.   :0.9495

nrow(modellingDF[is.na(DecCol),])   # results in 0

modellingDF[,list(len=.N,DecCollen=length(DecCol),IntCollen=length
(IntCol ),Outcomelen=length(Outcome)),by=Bracket]

  Bracket  len DecCollen IntCollen Outcomelen
1:     3-6 39184  39184       39184      39184
2:     1-2 19909  19909       19909      19909
3:       0  9912   9912        9912       9912

Perhaps I'm having a dozy day, but could anyone suggest a solution or a means for digging into this issue further?

Steph Locke
  • 5,951
  • 4
  • 39
  • 77
  • NAs? [R Variable Length Differ when build linear model for residuals](http://stackoverflow.com/questions/14924541/r-variable-length-differ-when-build-linear-model-for-residuals) – zx8754 Sep 25 '13 at 10:18
  • 1
    I considered it, but `sapply(modellingDF, function(x) all(is.na(x)))` returns FALSE for every column – Steph Locke Sep 25 '13 at 10:22
  • Can you make a reproducible example that produces the error? You've shown the error which is good, but not what produces it, iiuc. – Matt Dowle Sep 25 '13 at 10:33
  • I was just trying to add a reproducible example with some dput results for you and noticed something rather strange - when I change the columns name away from the actual column name it actually works. `modellingDF[sample(1:nrow(modellingDF),200),list(IntCol=Age,IntCol2=Score,Outcome,abc=LTV),key=Bracket]` works, but `modellingDF[sample(1:nrow(modellingDF),200),list(IntCol=Age,IntCol2=Score,Outcome,LTV),key=Bracket]` won't. I thought perhaps I had a variable called LTV but nope, plus data.table should take internal variables in preference – Steph Locke Sep 25 '13 at 10:56
  • 1
    What is `x` in your example? (ie `glm(x, formula = ...)`. Generally you need to reference `.SD` as the `data` argument for the correct environment to be used. – mnel Sep 25 '13 at 11:04
  • Or more pertinently what is `x` in your global environment? – mnel Sep 25 '13 at 11:11
  • I thought x was the carry through alias for the subsets of data.table i.e. used here for putting the data into the glm for each Bracket value. I might be wholly wrong though! – Steph Locke Sep 25 '13 at 11:13
  • ah, `.SD` will go away and try that as, like you say, there may be some issues with the X value. Cheers for the clear up! – Steph Locke Sep 25 '13 at 11:20
  • @mnel, if you add an answer to the effect that I should be using .SD as opposed to x, to correctly pass values through then I'll mark your answer as fixing the issue – Steph Locke Sep 25 '13 at 11:26

2 Answers2

8

You need to correctly specify the data argument within glm. Inside a data.table (using [), this is referenced by .SD. (see create a formula in a data.table environment in R for related question)

So

modellingDF[,list(Outcome, fitted = glm(data = .SD, 
  formula = Outcome ~ IntCol ,family = binomial(link = logit))$fitted),
 by=variable]

will work.

While in this case (simply extracting the fitted values and moving on), this approach is sound, using data.table and .SD can get in a mess of environments if you are saving the whole model and then attempting to update it (see Why is using update on a lm inside a grouped data.table losing its model data?)

Community
  • 1
  • 1
mnel
  • 113,303
  • 27
  • 265
  • 254
  • 2
    This answer is a bit outdated. `modellingDF[,.(Outcome, fitted = glm(Outcome ~ IntCol ,family = binomial)$fitted), by=variable]` should work and is much cleaner. – MichaelChirico Dec 21 '15 at 04:23
0

In addition to @mnel's answer, you can avoid problems with NAs in your data by using the appropriate function to extract fitted values and specifying the appropriate na.action in glm:

modellingDF[, list(Outcome, fitted = 
   fitted(glm(data = .SD, 
       formula = Outcome ~ IntCol ,
       family = binomial(link = logit),
       na.action=na.exclude)
   ), by=variable]

This will return an object with fitted values of the same size as the original data, preserving the NAs but excluding them from the model estimation.

altabq
  • 1,322
  • 1
  • 20
  • 33