You need to understand where the bottlenecks are in your code before you start trying to change it to make it faster. For example:
timer <- function(action, thingofinterest, limits) {
st <- system.time({ # for the wall time
Rprof(interval=0.01) # Start R's profile timing
for(j in 1:1000) { # 1000 function calls
test = vector("list")
for(i in 1:length(limits)) {
test[[i]] = action(thingofinterest, limits, i)
}
}
Rprof(NULL) # stop the profiler
})
# return profiling results
list(st, head(summaryRprof()$by.total))
}
action = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
secondtask = min(firsttask[which(firsttask>mean(firsttask))])
thirdtask = mean(firsttask)
fourthtask = length(firsttask)
output = list(firsttask, data.frame(average=secondtask,
min_over_mean=thirdtask,
size=fourthtask))
return(output)
}
timer(action, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 9.720 0.012 9.737
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 9.72 100.00 0.07 0.72
# "timer" 9.72 100.00 0.00 0.00
# "action" 9.65 99.28 0.24 2.47
# "data.frame" 8.53 87.76 0.84 8.64
# "as.data.frame" 5.50 56.58 0.44 4.53
# "force" 4.40 45.27 0.11 1.13
You can see that very little time is spent outside the call to your action
function. Now, for
is a special primitive and is therefore not captured by the profiler, but the total time reported by the profiler is very similar to the wall time, so there can't be a lot of time missing from the profiler time.
And the thing that takes the most time in your action
function is the call to data.frame
. Remove that, and you get an enormous speedup.
action1 = function(x,y,i) {
firsttask = cumsum(x[which(x<y[i])])
secondtask = mean(firsttask)
thirdtask = min(firsttask[which(firsttask>mean(firsttask))])
fourthtask = length(firsttask)
list(task=firsttask, average=secondtask,
min_over_mean=thirdtask, size=fourthtask)
}
timer(action1, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 1.020 0.000 1.021
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 1.01 100.00 0.06 5.94
# "timer" 1.01 100.00 0.00 0.00
# "action" 0.95 94.06 0.17 16.83
# "which" 0.57 56.44 0.23 22.77
# "mean" 0.25 24.75 0.13 12.87
# "<" 0.20 19.80 0.20 19.80
Now you can also get rid of one of the calls to mean
and both calls to which
.
action2 = function(x,y,i) {
firsttask = cumsum(x[x < y[i]])
secondtask = mean(firsttask)
thirdtask = min(firsttask[firsttask > secondtask])
fourthtask = length(firsttask)
list(task=firsttask, average=secondtask,
min_over_mean=thirdtask, size=fourthtask)
}
timer(action2, 1:1000, 50:100)
# [[1]]
# user system elapsed
# 0.808 0.000 0.808
#
# [[2]]
# total.time total.pct self.time self.pct
# "system.time" 0.80 100.00 0.12 15.00
# "timer" 0.80 100.00 0.00 0.00
# "action" 0.68 85.00 0.24 30.00
# "<" 0.20 25.00 0.20 25.00
# "mean" 0.13 16.25 0.08 10.00
# ">" 0.05 6.25 0.05 6.25
Now you can see there's a "significant" amount of time spent doing stuff outside your action
function. I put significant in quotes because it's 15% of the runtime, but only 120 milliseconds. If your actual code took ~12 hours to run, this new action
function would finish in ~1 hour.
The results would be marginally better if I pre-allocated the test
list outside of the for
loop in the timer
function, but the call to data.frame
is the biggest time-consumer.