0

I am currently trying to speed up a nested if else statement that selects a model based on a parameter. The function below receives input from previous function in the form of the "model" argument. The model argument can be either "glm", "logarithmic", or "squared". This is currently done with a nested if ,else if, else if statement. I have tried framing this as two ifelse statement and my benchmarks do not show there is a performance increase. My program has to do this many times over for many different datasets so any increase in performance would be huge.

I am trying a case when from dpylr but having never used it I'm not sure if it is appropriate, any suggestions?

the code below gives me the following; Error: must be list, not lm

Edited to add sample data and previous function for reproducibility

data <- sample(1000)
df <- data.frame(data)
df[2] <- sample(1000)
names(df) <- c(y,x)

#previous function
produce_model <- function(model,df){ 
if (model=="glm")
{
 model<-lm(y~x,df)
}
else if (model=="logarithmic")
{
model<-lm(log(y)~x,df)
}
else if (model=="squared")
{
model<-lm(y^2~x,df)
}
return(model)
}


#Trying to improve with case_when()
library(dplyr)
produce_model <- function(model,df) {  
case_when(
model == 'glm' ~ lm(y~x,df=df),
model == 'logarithmic' ~ lm(log(y)~x,df=df),
model == 'squared' ~ lm(y^2~x,df=df)
          )
          return(model)
        }
mitch
  • 379
  • 1
  • 3
  • 14
  • you could use `switch` for this: i.e. `switch(model,; 'glm' = lm(y~x,df),; 'logarithmic'= lm(log(y)~x,df),; 'squared' = lm(y^2~x,df); )` – user20650 Oct 16 '18 at 18:20
  • Thanks for the tip, unfortunately it seems to perform about the same as my original if else statement – mitch Oct 16 '18 at 18:41
  • 2
    its likely slightly neater code though. But i'd think any bottleneck comes from estimating the model and not dispatching to the regression type. I suppose if you have lots of models to run, you could try parallelising it – user20650 Oct 16 '18 at 18:45
  • If you make this a [reproducible question](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), someone can likely help. I'd also suggest the ["Many Models"](http://r4ds.had.co.nz/many-models.html) chapter of R for Data Science – camille Oct 16 '18 at 18:54
  • 1
    do you think you could expand on this comment a little? 'But i'd think any bottleneck comes from estimating the model and not dispatching to the regression type' I'm not really sure what you mean by this – mitch Oct 16 '18 at 18:55
  • 1
    You say you are trying to "speed up" this code. I just ran a benchmark and it takes my R installation on average 700 nanoseconds to evaluate a condition like `model == "glm"`. Now, you have 3 `==` conditions, so multiplying by 3 we could expect about 0.0000021 seconds being spent on checking those conditions. That's a worst-case scenario, if the first condition is true, then only the first condition will be checked... – Gregor Thomas Oct 16 '18 at 19:13
  • 1
    Benchmarking your model on the sample data you provide, a model fit for the simplest `"glm"` case takes 1.2 milliseconds = 0.0012 seconds. So your total code execution time is 0.0012 + 0.0000021 seconds = 0.0012021 seconds. If you could work magic and make it so that the condition-checking was absolutely instantaneous, you would reduce the time to just model fitting, 0.0012 seconds, a whopping 0.1% faster. Basically a rounding error. – Gregor Thomas Oct 16 '18 at 19:17
  • 2
    To speed up code effectively, you need to speed up the parts that are slow. What you're asking in this question is how to speed up the fastest part of your code. This is pointless. In your code, the fitting of the model takes up about 99.9% of the processing time, so if you want to speed up your code by more than 0.1%, it's the model fitting part you need to speed up. If you search you can find resources for that, such as the `speedglm` package. – Gregor Thomas Oct 16 '18 at 19:20
  • Thanks for your input @Gregor, I understand this is not truly a bottleneck. however there are many examples of this exact same type of nested if else statement within my script some are many levels deeper. I am trying to understand my options on this one example in order to apply that learning to the others. essentially I'm looking at my options when it comes to this setup – mitch Oct 16 '18 at 19:26
  • See also [How to speed up `lm` in R?](https://stackoverflow.com/q/25416413/903061). This is also a good case for parallel processing. – Gregor Thomas Oct 16 '18 at 19:26
  • **Checking an `if` statement is trivial**. Not only is it stupid simple, it is so common that if there were even *tiny* speed gains to be had, they would be quickly incorporated into base R to benefit everyone. You will not get a noticeable increase in speed by trying to optimize your `if` statements unless your code is doing literally nothing else. (And even then, there probably aren't gains to be had.) You should [profile your code](http://adv-r.had.co.nz/Profiling.html) if you want to see where it's actually taking time. – Gregor Thomas Oct 16 '18 at 19:28
  • See also [General strategies for speeding up R code](https://stackoverflow.com/a/8474941/903061), an excellent answer to [Speeding up the loop operation in R](https://stackoverflow.com/a/8474941/903061)? – Gregor Thomas Oct 16 '18 at 19:34

0 Answers0