4

I want to use R to estimate a regression with a very large number of fixed effects.

I then what to use that regression to predict with a test data set.

However, this needs to be done very quickly because I want to bootstrap my standard errors and do this many times.

I know the lfe package in R can do this. For example

reg=felm(Y~1|F1 + F2,data=dat)

Where dat is the data, F1,F2 are columns of categorical variables (the fixed effects to be included).

predict(reg,dat2), however, does not work with the lfe package...as has been discussed here.

Unfortunately lm is too slow as I have a very large numbers of fixed effects.

wolfsatthedoor
  • 7,163
  • 18
  • 46
  • 90
  • What is your question? If you are looking for another package or resource that would do this faster than `lm`, that seems *off-topic* – Kevin Arseneau Jan 23 '18 at 23:37

1 Answers1

7

The way to speed this up is to extract the coefficients and perform the matrix operations manually. E.g.:

xtrain <- data.frame(x1=jitter(1:1000), x2=runif(1000), x3=rnorm(1000))
xtest <- data.frame(x1=jitter(1:1000), x2=runif(1000), x3=rnorm(1000))
y <- -(1:1000)
fit <- lm(y ~ x1 + x2 + x3, data=xtrain)

beta <- matrix(coefficients(fit), nrow=1)
xtest_mat <- t(data.matrix(cbind(intercept=1, xtest)))
predictions <- as.vector(beta %*% xtest_mat)

library(microbenchmark)
microbenchmark(as.vector(beta %*% xtest_mat),
               predict(fit, newdata = xtest))

Unit: microseconds
                          expr     min       lq      mean  median      uq      max neval cld
 as.vector(beta %*% xtest_mat)   8.140  10.0690  13.12173  12.372  15.852   26.292   100  a 
 predict(fit, newdata = xtest) 635.413 657.2515 745.94840 673.009 763.166 2363.065   100   b

So you can see that direct matrix multiplication is ~50x faster than the predict function.

thc
  • 9,527
  • 1
  • 24
  • 39
  • Sorry this does not answer the question. F1 and F2 are categorical variables. Fitting the lm with a high number of categorical variables is too costly, that was the point to begin with. – wolfsatthedoor Feb 03 '18 at 03:29
  • You didn't provide a reproducible example. You also missed the point: do the predict operation manually will speed up, regardless of whether you use felm or lm. – thc Feb 03 '18 at 05:55
  • You are correct I did not provide a reproducible example. But, you need the fixed-effects (factor coefficients) to predict regardless of whether you do it manually or not. I found a solution which involves getfe(felm_obejct), but again lm is too slow, that was the whole point. It's not that predict is slow, it's that a regression with a large number of factors is slow. – wolfsatthedoor Feb 03 '18 at 05:57