0

I am using a function in R in which a number of data sets are generated and saved in a list (using lapply). These data sets are then evaluated with the lmer-function of the lme4 package (again using lapply).

This is the main function together with a helper function to estimate the models (= estimate_model) and a function that generates datasets (= generate_data):

estimate_model <- function(est.model,data) {  

    tmp <- lme4::lmer(formula(est.model),data=data)  
    fixeff <- getME(tmp,"fixef")  
    resvar <- getME(tmp,"sigma")^2 
    raneff <- as.data.frame(VarCorr(tmp))$vcov[1:3]  
    res <- c(fixeff,raneff,resvar)  
    res 

}  

generate_data <- function(est.model, no.samples) {

    # get some things  
    data <- est.model@frame  
    data.star <- lapply(1:no.samples,function(x) { res = data; res; })           
    return(data.star)  

} 

main_function <- function(est.model, no.samples) {  

    # generate data  
    data.star = generate_data(est.model,no.samples)  

    # evaluate the datasets  
    b.fit = lapply(data.star,function(x) { estimate_model(est.model,x) })  
    b.fit = t(do.call("rbind",b.fit))  

    # output  
    return(b.fit)  

}

The function calls the generate_data function in which an estimated lme4-object is used to generate a number of new datasets (defined by no.samples) that are saved in a list and each element of the list is then evaluated using the lmer-function that is called in the estimate_model - function.

Now, my problem or question. When I run my code and use system.time to check the speed of the function, I found that my computer needs about 24 s for no.samples = 100. The code would like this

library(nlme)  
library(lme4)  
# some example data
dfs <- Orthodont
fit <- lmer(distance ~ 1 + age + (1 + age|Subject), data = dfs)
# speed of the function
system.time(main_function(fit, no.samples = 100))
... 23.24 seconds

This is slow and due to an accident I observed that when I split my main function into the generate_data function and another function that basically contains the b.fit-lines, for example

call_to_estimate <- function(data.star,est.model) {  

    # fit the model to all B-samples  
    b.fit = lapply(data.star,function(x) { estimate_model(est.model,x) })
    b.fit = t(do.call("rbind",b.fit))  

    return(b.fit)   
}

and then independently call these two functions, then this is a lot faster. For instance, for 100 samples the code and time is

system.time(XX <- generate_data(fit, no.samples = 100))    
...0 seconds

system.time(call_to_estimate(XX,fit))    
.. 4.55 seconds

So, splitting the function into two parts and calling the resulting two functions in succession increases the performance a lot. I have no explanation for this behavior and was wondering whether someone could explain this to me because I very often use functions that call two or more functions within it.

Thanks in advance, Stefan

Stefan
  • 11
  • 2
  • 1
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Mar 02 '18 at 19:53
  • I have added a reproducible example. – Stefan Mar 05 '18 at 09:54

0 Answers0