-2

Example arbitrary df:

Dates        NY       CA       OH       MA
2018-01-01   9073     4564     2342     5645
2018-01-02   2342     4565     3453     5675
2018-01-03   1234     7567     5345     6877
2018-01-04   1231     3545     3453     7686
2018-01-05   4512     4564     3453     6787
.....        ....     ....     ....     ....

I am trying to run a iterative stepwise regression on a df containing >300 variables. I set-up an easy dffor myself (illustrated above) to practice getting there.

Here is what I wish to achieve: 1) Stepwise regression 2) Remove coefficients >= specified value (choose any for testing, say .1) 3) Re-run stepwise regression 4) Repeat step 2 or stop if constraint specified not met (meaning, stop loop if all coefficients are within the specified value)

Here are the pieces I have thus far:

1) step(lm(NY~. -Dates, df))

2) names(coef(df))[which(coef(df)<=.2)]

So 1 accomplishes running stepwise regression and 2 accomplishes naming the coefficients that are less than or equal to a specified value, say 0.2. How do I combine the code to remove those values from the step 1 regression and re-apply step wise regression without the variables, and continue this process until all variables comply within a specified range?

Thank you,

g3lo
  • 153
  • 1
  • 11
  • 2
    I don't quite understand why you're setting a constraint on the coefficient itself. Shouldn't you have some cutoff for goodness-of-fit criteria instead? What happens if a coefficient is not significant but is greater than 2? Do you still include it? – acylam Jun 11 '18 at 19:23
  • I am just trying to figure out how to do the following. Irrelevant of fit, p-value, etc. I want to exclude all coefficients less than or greater than some value and repeat the stepwise function until all variables don't fall into such constraint. Are you able to assist? – g3lo Jun 11 '18 at 19:33
  • I understand what you are trying to do. I'm just not convinced that how you're doing it is correct. Since you are trying to minimize the square residuals of your model subject to some (linear) constraints, you might want to look into quadratic programming using the `quadprog` package, instead of stepwise OLS. Check this answer: https://stackoverflow.com/questions/45577591/linear-regression-with-constraints-on-the-coefficients – acylam Jun 11 '18 at 19:54
  • I am relatively new to R and I have no luck in coming close to understanding that code – g3lo Jun 11 '18 at 20:01
  • Perhaps you could provide some context on what you're trying to do and why you think it's a good idea to do it this way? For example, is there any business reason for setting `coef <= 0.2`? What does it mean if a coef is greater than 0.2? – acylam Jun 11 '18 at 20:37
  • Hi, 0.2 is an arbitrary example. We know that data is not explained by such large coefficients and would like to drop them. I want to experiment running a model that would enable the features outlined in my OP. Thank you kindly. – g3lo Jun 11 '18 at 20:39
  • _"We know that data is not explained by such large coefficients"_. How do you know that without running a model in the first place? Is this based on domain knowledge or some models you have previously run? If it's the former, either your theory is wrong, or you have high measurement error in your data. Either way, excluding a variable because it's coefficient has a large magnitude is not the way to go. – acylam Jun 11 '18 at 20:55

1 Answers1

1

Like it is said in the comments of @useR, I don't believe in what you are doing but if you want an answer with the mechanics of it, here it goes.

Note that you do not need step, simple lm will produce exactly the same result.

I will use the built in dataset mtcars as an example.

data("mtcars")

response <- "mpg"
fmla <- as.formula(paste(response, ".", sep = "~"))

iter <- 0
fit <- lm(fmla, mtcars)
while(any(abs(coef(fit)) >= 2)){
  iter <- iter + 1
  nm <- names(coef(fit))[abs(coef(fit)) < 2]
  fmla <- if(any(grepl("Inter", nm)))
    paste(response, paste(nm, collapse = "+"), sep = "~")
  else
    paste(response, paste(c("0", nm), collapse = "+"), sep = "~")
  fmla <- as.formula(fmla)
  fit <- lm(fmla, mtcars)
}

fit
#
#Call:
#  lm(formula = fmla, data = mtcars)
#
#Coefficients:
#  cyl     disp       hp     qsec     carb  
#0.4795  -0.0465   0.0314   1.4152  -0.7596 

If you want to check that step is not needed, just replace the lm line with

fit <- step(lm(fmla, mtcars))
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • @useR Thanks for the edit, it was a last minute change. – Rui Barradas Jun 11 '18 at 20:16
  • Thank you kindly. I have attempted to try it on my data and nm is coming up as null. Trying to understand the formula to see what is causing it. I played around with < and >= numbers to no avail – g3lo Jun 11 '18 at 20:24
  • @g3lo See if `which(coef(fit) < 2)` solves it. If not, before assigning to `nm`, try to `print(names(coef(fit)))` and `print(which(coef(fit) < 2))` in order to see what is going on. – Rui Barradas Jun 11 '18 at 20:42
  • fit comes out to be 'no coefficients' – g3lo Jun 11 '18 at 20:45
  • @g3lo That means that `all(coef(fit) < 2)` is `TRUE`. If you want to see in how many iterations, run the edited code. – Rui Barradas Jun 11 '18 at 21:04
  • Hmm . . . doesn't seem to be working, I tried multiple data sets and iterations all show 1 with coefficients as large as 1000 – g3lo Jun 12 '18 at 13:35
  • For instance, I tried using it on a large data set, put remove coefficients greater than 1000, they are all still there – g3lo Jun 12 '18 at 13:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/172998/discussion-between-g3lo-and-rui-barradas). – g3lo Jun 12 '18 at 16:01