1

I’m trying to write simulation code, that generates data and runs t-test selection (discarding those predictors whose t-test p-value exceeds 0.05, retaining the rest) on it. The simulation is largely an adaptation of Applied Econometrics with R by Kleiber and Zeileis (2008, pp. 183–189).

When running the code, it usually fails. Yet with certain seeds (e.g. 1534) it produces plausible output. If it does not produce output (e.g. 1911), it fails due to: "Error in x[, ii] : subscript out of bounds", which traces back to na.omit.data.frame(). So, for some reason, the way I attempt to handle the NAs seems to fail, but I'm unable to figure out in how so.

  coef <- rep(coef[,3], length.out = pdim+1)
  err <- as.vector(rnorm(nobs, sd = sd))
  uX <- c(rep(1, times = nobs))
  pX <- matrix(scale(rnorm(nobs)), byrow = TRUE, ncol = pdim, nrow = nobs)
  X <- cbind(uX, pX)
  y <- coef %*% t(X) + err
  y <- matrix(y)

  tTp <- (summary(lm(y ~ pX)))$coefficients[,4]  
  tTp <- tTp[2:length(tTp)]
  TTT <- matrix(c(tTp, rep(.7, ncol(pX)-length(tTp))))

  tX <- matrix(NA, ncol = ncol(pX), nrow = nrow(pX))
  for(i in 1:ncol(pX)) {ifelse(TTT[i,] < ALPHA, tX[,i] <- pX[,i], NA)}
  tX <- matrix(Filter(function(x)!all(is.na(x)), tX), nrow = nobs)
  TTR <- lm(y ~ tX)

The first block is unlikely to the cause of the error. It merely generates the data and works well on its own and with other methods, like PCA, as well. The second block pulls the p-values from the regression output; removes the p-value of the intercept (beta_0); and fills the vector with as many 7s as necessary to have the same length as the number of variables, to ensure the same dimension for matrix calculations. Seven is arbitrary and could be any number larger than 0.05 to not pass the test of the loop. This becomes – I believe – necessary, if R discards predictors due to multicollinearity.

The final block creates an empty matrix of the original dimensions; inserts the original data, if the t-test p-value is lower than 0.05, else retains the NA; while the penultimate line removes all columns containing NAs ((exclusively NA or one NA is the same here) taken from mnel’s answer to Remove columns from dataframe where ALL values are NA); lastly, the modified data is again put in the shape of a linear regression.

Does anyone know what causes this behavior or how it would work as intended? I would expect it to either work or not, but not kind of both. Ideally, the former.

A working version of the code is:

set.seed(1534)
Sim_TTS  <- function(nobs = c(1000, 15000), pdim = pdims, coef = coef100, 
    model = c("MLC", "MHC"), ...){
 DGP_TTS <- function(nobs = 1000, model = c("MLC", "MHC"), coef = coef100, 
     sd = 1, pdim = pdims, ALPHA = 0.05)
 {
  model <- match.arg(model)
  if(model == "MLC") {
   coef <- rep(coef[,1], length.out = pdim+1)
   err <- as.vector(rnorm(nobs, sd = sd))
   uX <- c(rep(1, times = nobs))
   pX <- matrix(scale(rnorm(nobs)), byrow = TRUE, ncol = pdim, nrow = nobs)
   X <- cbind(uX, pX)
   y <- coef %*% t(X) + err
   y <- matrix(y)

   tTp <- (summary(lm(y ~ pX)))$coefficients[,4]  
   tTp <- tTp[2:length(tTp)]
   TTT <- matrix(c(tTp, rep(.7, ncol(pX)-length(tTp)))) 

   tX <- matrix(NA, ncol = ncol(pX), nrow = nrow(pX)) 
   for(i in 1:ncol(pX)) {ifelse(TTT[i,] < ALPHA, tX[,i] <- pX[,i], NA)}
   tX <- matrix(Filter(function(x)!all(is.na(x)), tX), nrow = nobs) 
   TTR <- lm(y ~ tX) 
   } else {
   coef <- rep(coef[,2], length.out = pdim+1)
   err <- as.vector(rnorm(nobs, sd = sd))
   uX <- c(rep(1, times = nobs))
   pX <- matrix(scale(rnorm(nobs)), byrow = TRUE, ncol = pdim, nrow = nobs)
   X <- cbind(uX, pX)
   y <- coef %*% t(X) + err
   y <- matrix(y)

   tTp <- (summary(lm(y ~ pX)))$coefficients[,4]  
   tTp <- tTp[2:length(tTp)]
   TTT <- matrix(c(tTp, rep(.7, ncol(pX)-length(tTp))))

   tX <- matrix(NA, ncol = ncol(pX), nrow = nrow(pX))
   for(i in 1:ncol(pX)) {ifelse(TTT[i,] < ALPHA, tX[,i] <- pX[,i], NA)}
   tX <- matrix(Filter(function(x)!all(is.na(x)), tX), nrow = nobs)
   TTR <- lm(y ~ tX)
   }
  return(TTR)
  }
  PG_TTS <- function(nrep = 1, ...)
  {
   rsq <- matrix(rep(NA, nrep), ncol = 1)
   rsqad <- matrix(rep(NA, nrep), ncol = 1)
   pastr <- matrix(rep(NA, nrep), ncol = 1)
   vmat <- cbind(rsq, rsqad, pastr)
   colnames(vmat) <- c("R sq.", "adj. R sq.", "p*")
   for(i in 1:nrep) {
     vmat[i,1] <- summary(DGP_TTS(...))$r.squared
     vmat[i,2] <- summary(DGP_TTS(...))$adj.r.squared
     vmat[i,3] <- length(DGP_TTS(...)$coefficients)-1
     }
   return(c(mean(vmat[,1]), mean(vmat[,2]), round(mean(vmat[,3]))))
  }
  SIM_TTS <- function(...)
  {
   prs <- expand.grid(pdim = pdim, nobs = nobs, model = model)
   nprs <- nrow(prs)

   pow <- matrix(rep(NA, 3 * nprs), ncol = 3)
   for(i in 1:nprs) pow[i,] <- PG_TTS(pdim = prs[i,1],
       nobs = prs[i,2], model = as.character(prs[i,3]), ...)

   rval <- rbind(prs, prs, prs)
   rval$stat <- factor(rep(1:3, c(nprs, nprs, nprs)),
       labels = c("R sq.", "adj. R sq.", "p*"))
   rval$power <- c(pow[,1], pow[,2], pow[,3])
   rval$nobs <- factor(rval$nobs)
   return(rval)
  }

 psim_TTS <- SIM_TTS()
 tab_TTS <- xtabs(power ~ pdim + stat + model + nobs, data = psim_TTS)
 ftable(tab_TTS, row.vars = c("model", "nobs", "stat"), col.vars = "pdim")}

 FO_TTS <- Sim_TTS()
 FO_TTS
}

Preceeded by:

pdims <- seq(12, 100, 4)
coefLC12 <- c(0, rep(0.2, 4), rep(0.1, 4), rep(0, 4))/1.3
rtL <- c(0.2, rep(0, 3))/1.3
coefLC100 <- c(coefLC12, rep(rtL, 22))
coefHC12 <- c(0, rep(0.8, 4), rep(0.4, 4), rep(0, 4))/1.1
rtH <- c(0.8, rep(0, 3))/1.1
coefHC100 <- c(coefHC12, rep(rtH, 22))
coef100 <- cbind(coefLC100, coefHC100)

I’m aware that model selection via the significance of individual predictors is not recommended, but that is the whole point – it is meant to be compared to more sophisticated methods.

r7cxs
  • 11
  • 3
  • Are you aware of `MuMIn::dredge`? – jsta Mar 22 '18 at 13:29
  • The error seems to be caused by `tX` being a zero-column matrix in `lm(y ~ tX)` when none of the predictors pass the test, i.e. `all(TTT > ALPHA)`. A minimal reprex: `lm(runif(5) ~ matrix(logical(0), nrow = 5))`. You should handle this case separately. – Mikko Marttila Mar 22 '18 at 14:37
  • @jsta: I'm not. Also, from the documentation, it is not clear to me how the dredge function would help. Is it to say that there is an automated way to perform t-test selection via some kind of parameter that would implicitly do what my code is supposed to? – r7cxs Mar 22 '18 at 18:57
  • @MikkoMarttila: It appears improbable that every predictor would fail to pass that test, for the bar is rather low and every model has at the very least four predictors for which it is virtually impossible to fail the attain a p-value of less than 0.05. So, maybe if it would happen rarely, but it happens with most seeds. Nonetheless, thanks for the minimal example that reproduces the error. – r7cxs Mar 22 '18 at 18:57
  • Perhaps my wording was overly careful: that _is_ what's causing the error, at least with seed 1911. You can see that by using `tryCatch` to trigger a browser when the error occurs in the `lm` call and checking `tX`. If you're not expecting this, perhaps there is some other issue with the function that's causing the behaviour? – Mikko Marttila Mar 22 '18 at 20:20
  • @MikkoMarttila Then the problem is larger than I expected, which remains strange since some seeds produce the output one would expect. I guess, I gotta redo the t-test selection part. Thank you for pointing the error out to me. – r7cxs Mar 25 '18 at 07:09

0 Answers0