2

I'm using the standard glm function with step function on 100k rows and 107 variables. When I did a regular glm I got the calculation done within a minute or two but when I added step(glm(...)) it runs for hours.

I tried to run it as a matrix, but it is still running for about 0.5 hour and I'm not sure it will ever be done. When I ran it on 9 variables it gave me the answers in a few seconds but with 9 warnings: all of them were "Warning messages:1: glm.fit: fitted probabilities numerically 0 or 1 occurred "

I used the line of code below: is it wrong? What should I do in order to gain better running time?

logit1back <- step(glm(IsChurn ~ var1 + var2+ var3+ var4+ 
      var5+ var6+ var7+ var8+ var9, data=tdata , family='binomial'))
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
mql4beginner
  • 2,193
  • 5
  • 34
  • 73
  • 1
    This isn't wrong from a programming perspective, but it is quite bad from a statistical perspective. Start by Googling the warning message, which probably would have led you [here](http://stackoverflow.com/q/8596160/324364) anyway. That should prompt you to look at your data more closely before blindly fitting lots of models. Next Google "stepwise regression bad" and start reading. – joran Mar 25 '14 at 16:40
  • 1
    Another phrase to throw into the ring would be "information theoretic approach model selection". – Roman Luštrik Mar 25 '14 at 17:21
  • 5
    Assigning `glm` before using `step` might speed it up, meaning `x <- glm(...)`, then `step(x)`. As it is now, you're calling `glm` for every step, which requires R to make more calculations than necessary. Notice in `example(step)` the linear model is assigned prior to calling `step` – Rich Scriven Mar 25 '14 at 18:08
  • 1
    I would also suggest using the `scope` argument in `step`. Do you really want to consider all 107 variables? Are there not some that you can rule out as in not particularly meaningful to the problem or collinearity issues? Even though you have 100k observations, that is still fewer than 10 data points per covariate if you are using all 107 variables. Was this the first step to your approach? – rawr Mar 25 '14 at 18:38
  • Thanks for you answers and replies. In my experience ( I did about 50 predictive models for various of fields - not in R though) the usage of stepwise in Logistic regression has helped me alot to get a stable model.Again, thanks a lot for your feedbacks. – mql4beginner Mar 26 '14 at 12:54
  • Hey What to do if I have more than 20 columns and many are categorial. How to do logistic regression on that?? both stepwise and regular? – indra_patil May 05 '14 at 11:57
  • First, check each of them by itself,then do regular and then stepwise. – mql4beginner May 07 '14 at 07:08

0 Answers0