15

I am currenlty computing glm models off a huge data data set. Both glm and even speedglm take days to compute.

I currently have around 3M observations and altogether 400 variables, only some of which are used for the regression. In my regression I use 4 integer independent variables (iv1, iv2, iv3, iv4), 1 binary independent variable as factor (iv5), the interaction term (x * y, where x is an integer and y is a binary dummy variable as factor). Finally, I have fixed effects along years ff1 and company ids ff2. I have 15 years and 3000 conmpanies. I have introduced the fixed effects by adding them as factors. I observe that especially the 3000 company fixed effects make the computation very slow in stats glm and also speedglm.

I therefore decided to try Microsoft R's rxGlm (RevoScaleR), as this can address more threads and processor cores. Indeed, the speed of analysis is a lot faster. Also, I compared the results for a sub-sample to the one of standard glm and they matched.

I used the following function:

mod1 <- rxGlm(formula = dv ~ 
                      iv1 + iv2 + iv3+ 
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat,
                    dropFirst = TRUE, dropMain = FALSE, covCoef = TRUE, cube = FALSE)

However, I am facing a problem when trying to plot the interaction term using the effects package. Upon calling the following function, I am receiving the following error:

> plot(effect("x*y", mod1))
Error in terms.default(model) : no terms component nor attribute

I assume the problem is that rxGlm does not store the data needed to plot the interaction. I believe so because the rxGlm object is a lot smaller than the glm oject, hence likely containing less data (80 MB vs several GB).

I now tried to convert the rxGlm object to glm via as.glm(). Still, the effects() call does not yield a result and results in the following error messages:

Error in dnorm(eta) : 
  Non-numerical argument for mathematical function
In addition: Warning messages:
1: In model.matrix.default(mod, data = list(dv = c(1L, 2L,  :
  variable 'x for y' is absent, its contrast will be ignored

If I now compare an original glm to the "converted glm", I find that the converted glm contains a lot less items. E.g., it does not contain effects and for contrasts it states only contr.treatment for each variable.

I am now looking primarily for a way to transpose the rxGlm output object in a format so I can use if with the effect() function. If there is no way to do so, how can I get an interaction plot using functions within the RevoScaleR package, e.g., rxLinePlot()? rxLinePlot() also plots reasonably quick, however, I have not yet found a way how to get typical interaction effect plots out of it. I want to avoid first calculating the full glm model and then plot because this takes very long.

deca
  • 730
  • 1
  • 8
  • 24
  • 1
    How big is the dataset, and do you have a data sample? – Technophobe01 Nov 05 '17 at 23:07
  • @Technophobe01 I have added more information on the data right into the question (second paragraph). Given that my question is more about amount of data and less about a specific data problem, I think posting a sample of my data would make little sense. Posting the data set in its entirity would not be possible due to confidentiality and size of the file. – deca Nov 06 '17 at 16:20
  • If each variable is 40 kilobytes and you have 400 variables and 3,000,000 million observations you're dealing with roughly 48Tb of data. Correct? Bigger or Smaller? My point is your output of 80mb doesn't sound quite right to the dataset size. Caveat - a rule of thumb estimate. – Technophobe01 Nov 06 '17 at 17:08
  • Let me see if I can pull an example together using a sizable public data set. Might be fun to a comparison between MSFT R, Tibco TERR and Open R. – Technophobe01 Nov 06 '17 at 17:08
  • No, the data set is smaller. If loaded into the R workspace the data frame is around 9gb. The stats glm on this data is around 40gb. The rxGlm file only a couple of MB. – deca Nov 06 '17 at 17:19
  • [Smiles] - I'd remove "huge" form you dataset description. Don't take that the wrong way I am chuckling as I say it. Joking aside - thanks for updating the answer. – Technophobe01 Nov 06 '17 at 17:28
  • No worries :) - I’m well aware there are way larger data sets than that. The thing is even in a cloud instance it takes 1.5 weeks to compute the GLM which is why I’m trying to handle somehow – deca Nov 06 '17 at 17:38
  • Quick question what is your OS preference Windows, Ubuntu, Centos, Redhat or MacOS X? – Technophobe01 Nov 06 '17 at 17:42
  • I’m working with Windows 64 bit – deca Nov 06 '17 at 17:45
  • Try h2o package. – Slouei Jul 17 '19 at 17:35

1 Answers1

1

If you can get the coefficients can't you just roll your own? This would not be a dataset size issue

# ex. data
n = 2000
dat <- data.frame( dv = sample(0:1, size = n, rep = TRUE), 
                   iv1 = sample(1:10, size = n, rep = TRUE),
                   iv2 = sample(1:10, size = n, rep = TRUE),
                   iv3 = sample(1:10, size = n, rep = TRUE),
                   iv4 = sample(0:10, size = n, rep = TRUE),
                   iv5 = as.factor(sample(0:1, size = n, rep = TRUE)),
                   x = sample(1:100, size = n, rep = TRUE),
                   y = as.factor(sample(0:1, size = n, rep = TRUE)),
                   ff1  = as.factor(sample(1:15, size = n, rep = TRUE)),
                   ff2  = as.factor(sample(1:100, size = n, rep = TRUE))
                   )

mod1 <- glm(formula = dv ~ 
                      iv1 + iv2 + iv3+ 
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat)

# coefficients for x, y and their interaction
x1 <- coef(mod1)['x']
y1 <- coef(mod1)['y1']
xy <- coef(mod1)['x:y1']

x <- 1:100
a <- x1*x
b <- x1*x + y1 + xy*x

plot(a~x, type= 'line', col = 'red', xlim = c(0,max(x)), ylim = range(c(a, b)))
lines(b~x, col = 'blue')
legend('topright', c('y = 0', 'y = 1'), col = c('red', 'blue'))

here is how to make a reproduceable

user3357177
  • 355
  • 2
  • 9