0

I am trying to compare different value replacement function speeds, but I am having a difficult time setting them onto a level playing field when I include data.tables set() and := functions.

I had set up my microbenchmark to have each function modify the dataframe locally so that I wouldn''t need to recreate a fresh dataframe each time. Like so:

# create replicable dataframe filled with 20% NAs and numeric values
set.seed(42)
Book1 <- 
    as.data.frame(matrix(sample(c(NA, runif(2, min = 1, max = 4)),   
                         3e6*10, replace=TRUE),  
                  dimnames = list(NULL, paste0("var", 1:3)), ncol=3))

Which looks like:

> str(Book1)
'data.frame': 10000000 obs. of  3 variables:
$ var1: num  NA 3.81 3.74 3.74 3.81 ...
$ var2: num  3.74 3.81 NA NA 3.81 ...
$ var3: num  NA 3.74 3.81 3.74 NA ...

Then when I wrap the timings in local() or another function I can spare rebuilding the raw dataframe anew for each function for each trial that works diligently at replacing the NAs with 0s.

for example:

# Base R functions (from alexis_laz)
baseR.for.loop = function(x) { 
    for(j in 1:ncol(x))
        x[[j]][is.na(x[[j]])] = 0
}
> system.time({local(baseR_for(Book1))})
user  system elapsed 
0.28    0.14    0.42 

and the dataframe is left unchanged.

> str(Book1)
'data.frame': 10000000 obs. of  3 variables:
$ var1: num  NA 3.81 3.74 3.74 3.81 ...
$ var2: num  3.74 3.81 NA NA 3.81 ...
$ var3: num  NA 3.74 3.81 3.74 NA ...

Now when I run modify in place data.table functions to do the same, the original dataframe is getting modified even though I am wrapping the action in both a local() and a function().

library(data.table)
DT.set.nms = function(DT) {
    for (j in names(DT))
        set(DT,which(is.na(DT[[j]])),j,0)
}
> system.time({local(DT.set.nms(Book1))})
user  system elapsed 
0.14    0.00    0.14 
> str(Book1)
'data.frame': 10000000 obs. of  3 variables:
$ var1: num  0 3.81 3.74 3.74 3.81 ...
$ var2: num  3.74 3.81 0 0 3.81 ...
$ var3: num  0 3.74 3.81 3.74 0 ...

With some more study I discovered that the odd behavior was being caused by data.tables over allocation of memory and modify in place capabilities. While these are in fact amazingly powerful, particularly as I had not thought this was possible in R, they are not very helpful for my current microbenchmarking approach.)

So my initial question is:

How can I run a suite of functions side by side in microbenchmark and get the functions to all operate on a level playing field?

This is how I had been doing these NA replacement analyses (and how I have seen microbenchmarks typically performed).

library(microbenchmark)
perf_results <- microbenchmark(
    baseR_for        = local(baseR.for.loop(Book1)),
    baseR.replace    = local(replace(Book1, is.na(Book1), 0)),
    baseR.sbst.rssgn = local(Book1[is.na(Book1)] <- 0),
    times = 5L
)
> print(perf_results)
Unit: milliseconds
expr       min        lq      mean    median        uq       max neval
baseR_for  423.1889  464.4622  569.9093  636.3708  648.8386  676.6857     5
baseR.replace 1113.9829 1204.0874 1215.1211 1212.5199 1214.8138 1330.2012     5
baseR.sbst.rssgn 1156.9010 1161.4675 1262.6653 1218.1743 1360.4346 1416.3490     5

and the dataframe is left untouched.

> str(Book1)
'data.frame': 10000000 obs. of  3 variables:
$ var1: num  NA 3.81 3.74 3.74 3.81 ...
$ var2: num  3.74 3.81 NA NA 3.81 ...
$ var3: num  NA 3.74 3.81 3.74 NA ...

If I add the datatable functions to the analyses, they 'destroy' the original dataframe after the first time that it was passed into the DT functions. So to deal with this I was considering including an action to copy the original dataset each time, so each of the functions act only on the copy. Additionally, I was thinking that I would have to wrap every approach in its own custom function to keep being a custom function from becoming a confounding variable. This thus looks like:

baseR.sbst.rssgn <- function(x) { x[is.na(x)] <- 0; x }
baseR.replace <- function(x) { replace(x, is.na(x), 0)}

perf_results <- microbenchmark(
    baseR.for.loop   = baseR_for(copy(Book1)),
    baseR.replace    = baseR.replace(copy(Book1)),
    baseR.sbst.rssgn = baseR.sbst.rssgn(copy(Book1)),
    DT.set.nms       = DT.set.nms(copy(Book1)),
    times = 5L
)

This seems to work, and the dataframe is left untouched, but I am still left wondering...
Is there a better approach to performing microbenchmarks on data.tables modify in place functions?

leerssej
  • 14,260
  • 6
  • 48
  • 57
  • 2
    Using `copy` is the way to do it. Fwiw, ime, if `system.time`/`times=1` is not good enough to tell you the answer, then you're not optimizing the right thing. – eddi Nov 17 '17 at 17:27
  • 1
    Context: https://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-an-r-dataframe/41585689#41585689 – Frank Nov 17 '17 at 20:54
  • @uwe - it is indeed! Not sure why all my searches didn't pull this up. It looks like `copy()` really is the best approach then, and a great idea to benchmark it separately, too. Thank you! – leerssej Nov 18 '17 at 01:37

0 Answers0