I am trying to compare different value replacement function speeds, but I am having a difficult time setting them onto a level playing field when I include data.table
s set()
and :=
functions.
I had set up my microbenchmark to have each function modify the dataframe locally so that I wouldn''t need to recreate a fresh dataframe each time. Like so:
# create replicable dataframe filled with 20% NAs and numeric values
set.seed(42)
Book1 <-
as.data.frame(matrix(sample(c(NA, runif(2, min = 1, max = 4)),
3e6*10, replace=TRUE),
dimnames = list(NULL, paste0("var", 1:3)), ncol=3))
Which looks like:
> str(Book1) 'data.frame': 10000000 obs. of 3 variables: $ var1: num NA 3.81 3.74 3.74 3.81 ... $ var2: num 3.74 3.81 NA NA 3.81 ... $ var3: num NA 3.74 3.81 3.74 NA ...
Then when I wrap the timings in local()
or another function I can spare rebuilding the raw dataframe anew for each function for each trial that works diligently at replacing the NA
s with 0
s.
for example:
# Base R functions (from alexis_laz)
baseR.for.loop = function(x) {
for(j in 1:ncol(x))
x[[j]][is.na(x[[j]])] = 0
}
> system.time({local(baseR_for(Book1))}) user system elapsed 0.28 0.14 0.42
and the dataframe is left unchanged.
> str(Book1) 'data.frame': 10000000 obs. of 3 variables: $ var1: num NA 3.81 3.74 3.74 3.81 ... $ var2: num 3.74 3.81 NA NA 3.81 ... $ var3: num NA 3.74 3.81 3.74 NA ...
Now when I run modify in place data.table
functions to do the same, the original dataframe is getting modified even though I am wrapping the action in both a local()
and a function()
.
library(data.table)
DT.set.nms = function(DT) {
for (j in names(DT))
set(DT,which(is.na(DT[[j]])),j,0)
}
> system.time({local(DT.set.nms(Book1))}) user system elapsed 0.14 0.00 0.14 > str(Book1) 'data.frame': 10000000 obs. of 3 variables: $ var1: num 0 3.81 3.74 3.74 3.81 ... $ var2: num 3.74 3.81 0 0 3.81 ... $ var3: num 0 3.74 3.81 3.74 0 ...
With some more study I discovered that the odd behavior was being caused by data.table
s over allocation of memory and modify in place capabilities. While these are in fact amazingly powerful, particularly as I had not thought this was possible in R, they are not very helpful for my current microbenchmarking approach.)
So my initial question is:
How can I run a suite of functions side by side in microbenchmark
and get the functions to all operate on a level playing field?
This is how I had been doing these NA replacement analyses (and how I have seen microbenchmarks typically performed).
library(microbenchmark)
perf_results <- microbenchmark(
baseR_for = local(baseR.for.loop(Book1)),
baseR.replace = local(replace(Book1, is.na(Book1), 0)),
baseR.sbst.rssgn = local(Book1[is.na(Book1)] <- 0),
times = 5L
)
> print(perf_results) Unit: milliseconds expr min lq mean median uq max neval baseR_for 423.1889 464.4622 569.9093 636.3708 648.8386 676.6857 5 baseR.replace 1113.9829 1204.0874 1215.1211 1212.5199 1214.8138 1330.2012 5 baseR.sbst.rssgn 1156.9010 1161.4675 1262.6653 1218.1743 1360.4346 1416.3490 5
and the dataframe is left untouched.
> str(Book1) 'data.frame': 10000000 obs. of 3 variables: $ var1: num NA 3.81 3.74 3.74 3.81 ... $ var2: num 3.74 3.81 NA NA 3.81 ... $ var3: num NA 3.74 3.81 3.74 NA ...
If I add the datatable functions to the analyses, they 'destroy' the original dataframe after the first time that it was passed into the DT functions. So to deal with this I was considering including an action to copy the original dataset each time, so each of the functions act only on the copy. Additionally, I was thinking that I would have to wrap every approach in its own custom function to keep being a custom function from becoming a confounding variable. This thus looks like:
baseR.sbst.rssgn <- function(x) { x[is.na(x)] <- 0; x }
baseR.replace <- function(x) { replace(x, is.na(x), 0)}
perf_results <- microbenchmark(
baseR.for.loop = baseR_for(copy(Book1)),
baseR.replace = baseR.replace(copy(Book1)),
baseR.sbst.rssgn = baseR.sbst.rssgn(copy(Book1)),
DT.set.nms = DT.set.nms(copy(Book1)),
times = 5L
)
This seems to work, and the dataframe is left untouched, but I am still left wondering...
Is there a better approach to performing microbenchmarks on data.table
s modify in place functions?