I am using a function in R in which a number of data sets are generated and saved in a list (using lapply). These data sets are then evaluated with the lmer-function of the lme4 package (again using lapply).
This is the main function together with a helper function to estimate the models (= estimate_model) and a function that generates datasets (= generate_data):
estimate_model <- function(est.model,data) {
tmp <- lme4::lmer(formula(est.model),data=data)
fixeff <- getME(tmp,"fixef")
resvar <- getME(tmp,"sigma")^2
raneff <- as.data.frame(VarCorr(tmp))$vcov[1:3]
res <- c(fixeff,raneff,resvar)
res
}
generate_data <- function(est.model, no.samples) {
# get some things
data <- est.model@frame
data.star <- lapply(1:no.samples,function(x) { res = data; res; })
return(data.star)
}
main_function <- function(est.model, no.samples) {
# generate data
data.star = generate_data(est.model,no.samples)
# evaluate the datasets
b.fit = lapply(data.star,function(x) { estimate_model(est.model,x) })
b.fit = t(do.call("rbind",b.fit))
# output
return(b.fit)
}
The function calls the generate_data function in which an estimated lme4-object is used to generate a number of new datasets (defined by no.samples) that are saved in a list and each element of the list is then evaluated using the lmer-function that is called in the estimate_model - function.
Now, my problem or question. When I run my code and use system.time to check the speed of the function, I found that my computer needs about 24 s for no.samples = 100. The code would like this
library(nlme)
library(lme4)
# some example data
dfs <- Orthodont
fit <- lmer(distance ~ 1 + age + (1 + age|Subject), data = dfs)
# speed of the function
system.time(main_function(fit, no.samples = 100))
... 23.24 seconds
This is slow and due to an accident I observed that when I split my main function into the generate_data function and another function that basically contains the b.fit-lines, for example
call_to_estimate <- function(data.star,est.model) {
# fit the model to all B-samples
b.fit = lapply(data.star,function(x) { estimate_model(est.model,x) })
b.fit = t(do.call("rbind",b.fit))
return(b.fit)
}
and then independently call these two functions, then this is a lot faster. For instance, for 100 samples the code and time is
system.time(XX <- generate_data(fit, no.samples = 100))
...0 seconds
system.time(call_to_estimate(XX,fit))
.. 4.55 seconds
So, splitting the function into two parts and calling the resulting two functions in succession increases the performance a lot. I have no explanation for this behavior and was wondering whether someone could explain this to me because I very often use functions that call two or more functions within it.
Thanks in advance, Stefan