3

I am trying to run a stepwise regression using AIC (through step) with 1,400 variables, but my computer just freezes. It works if I include <300 variables (after 13 hrs of running).

Is there a way to eliminate some of the variables (if p-value >.7) before I run the stepwise regression?

# Polynomial Regression
REG19 <- lm(R10 ~ poly(M1, 3) + poly(M2, 3) + poly(M3, 3), WorkData)

# Is there a way to get rid of variables with 
# p values >.7 at this point of the code?

# Beginning of stepwise regression
n <- length(resid(REG19))
REG20 <- step(REG19, direction="backward", k=log(n))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Could you add a sample of your data with `dput(head(df,n))`? Choose `n` as you may find sufficient for reproducibility. – NelsonGon Jun 16 '19 at 05:35
  • Do you want to try something like this: `coef(summary(lm("Petal.Length~Sepal.Length+Petal.Width",data = iris))) %>% as_tibble() %>% filter(bac_tick_here_Pr(>|t|) another_back_tick_here < 0.00000002)`? – NelsonGon Jun 16 '19 at 05:42
  • 1
    I strongly urge you to consult a statistician. Your basic approach is already very dubious but your p-value cut-off makes it even worse. – Roland Jun 16 '19 at 08:42

1 Answers1

3

What you probably want is to exclude anything about the highest polynom where p <= .7 (lower degrees should be kept). Supposed you know what you're doing, you could write a function degAna() that analyzes the degrees of each polynomial and apply it to the coefficients matrix obtained by summary.

REG19 <- lm(R10 ~ poly(M1, 3) + poly(M2, 3) + poly(M3, 3) + poly(M4, 3) +
              poly(M5, 3) + poly(M6, 3) + poly(M7, 3) + poly(M8, 3) + 
              poly(M9, 3) + poly(M10, 3), WorkData)

rr <- summary(REG19)$coefficients

The function that detects highest degree with p <= .7:

degAna <- function(d) {
  out <- as.matrix(rr[grep(paste0(")", d), rownames(rr)), "Pr(>|t|)"] <= .7)
  dimnames(out) <- list(c(gsub("^.*\\((.*)\\,.+", "\\1", rownames(out))), d)
  return(out)
}

lapply degAna to coefficients matrix:

dM <- do.call(cbind, lapply(1:3, degAna))  # max. degree always 3 as in example
#         1     2     3
# M1   TRUE  TRUE  TRUE
# M2   TRUE  TRUE  TRUE
# M3  FALSE  TRUE  TRUE
# M4   TRUE  TRUE  TRUE
# M5   TRUE  TRUE  TRUE
# M6   TRUE FALSE  TRUE
# M7   TRUE FALSE FALSE
# M8   TRUE  TRUE  TRUE
# M9   TRUE  TRUE FALSE
# M10  TRUE FALSE  TRUE

Now we need the last degree of the polynomials where p <= .7:

tM <- apply(dM, 1, function(x) max(which(x != 0)))
tM <- tM[tM > 0]  # excludes polynomes where every p < .7
# M1  M2  M3  M4  M5  M6  M7  M8  M9 M10 
#  3   3   3   3   3   3   1   3   2   3 

(Note, that the apply will throw a warning if a polynomial has completely p <= .7, i.e. row is completely FALSE. Since we throw them out in the next line we can ignore the warning with apply(dM, 1, function(x) suppressWarnings(max(which(x != 0)))).)

With this information we can cobble together a new formula with reformulate,

terms.new <- paste0("poly(", names(tM), ", ", tM, ")")
FO <- reformulate(terms.new, response="R10")
# R10 ~ poly(M1, 3) + poly(M2, 3) + poly(M3, 3) + poly(M4, 3) + 
#     poly(M5, 3) + poly(M6, 3) + poly(M7, 1) + poly(M8, 3) + poly(M9, 
#     2) + poly(M10, 3)

with which we finally can do the desired shortened regression.

REG19.2 <- lm(FO, WorkData)

n <- length(resid(REG19.2))
REG20.2 <- step(REG19.2, direction="backward", k=log(n))
# [...]

Simulated Data

set.seed(42)
M1 <- rnorm(1e3)
M2 <- rnorm(1e3)
M3 <- rnorm(1e3)
M4 <- rnorm(1e3)
M5 <- rnorm(1e3)
M6 <- rnorm(1e3)
M7 <- rnorm(1e3)
M8 <- rnorm(1e3)
M9 <- rnorm(1e3)
M10 <- rnorm(1e3)
R10 <- 6 + 5*M1^3 + 4.5*M2^3 + 4*M3^2 + 3.5*M4 + 3*M5 + 2.5*M6 + 2*M7 + 
  .5*rnorm(1e3, 1, sd=20)
WorkData <- data.frame(M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, R10)
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thank you so much, but I got like 50 rows of this error after this line of code: "tM <- apply(dM, 1, function(x) max(which(x != 0)))"........................"Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -" – RegressionMan Jun 16 '19 at 11:40
  • I did suppressWarnings() and it worked, but is that *legal*? – RegressionMan Jun 16 '19 at 12:21
  • 1
    Yes, I just added this to my answer with some explanation. When you check the output of the concerned line you'll find `-Inf` which indicates that no `x != 0`. We throw this cases out in the next line where we subset with `tM[tM > 0] `. These are just those cases where no polynomial degree had `p<= .7` and will no more be included in the new formula. Ok? – jay.sf Jun 16 '19 at 12:29
  • Thanks, but when one of the variables such as M6 is FALSE for all 3, the reformulate step still puts M6 back in (instead of removing M6 from reformulate step). [link] (https://ibb.co/rv6bBqx). For example, M65=2, but reformate makes M7=2 because M65 is the 7th variable included from the beginning (as the others were removed). – RegressionMan Jun 16 '19 at 13:02
  • 1
    Thanks, that was an error in the `terms.new` line which referred rather back to `dM` instead of `tM`. I've changed it, does it work now? – jay.sf Jun 16 '19 at 13:08
  • 1
    PERFECT! You are THE BEST! – RegressionMan Jun 16 '19 at 14:19