Discrepancies in Benchmark and Processing Time Results

Question

I've been trying to do a little testing of the most efficient ways to replace NA's in dataframes.

I started with a comparison of of NA's to 0's replacement solutions on a 1 million row, 12 column dataset. Throwing all the pipe capable ones into microbenchmark I got the following results.

Question 1: Is there a way to test the subset left assignment statements (e.g.:df1[is.na(df1)] <- 0) inside the benchmark function?

library(dplyr)
library(tidyr)
library(microbenchmark)

set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e6 *12, replace=TRUE),
                            dimnames = list(NULL, paste0("var", 1:12)), ncol=12))

op <- microbenchmark(
    mut_all_ifelse   = df1 %>% mutate_all(funs(ifelse(is.na(.), 0, .))),
    mut_at_ifelse    = df1 %>% mutate_at(funs(ifelse(is.na(.), 0, .)), .cols = c(1:12)),
    # df1[is.na(df1)] <- 0 would sit here, but I can't make it work inside this function
    replace          = df1 %>% replace(., is.na(.), 0),
    mut_all_replace  = df1 %>% mutate_all(funs(replace(., is.na(.), 0))),
    mut_at_replace   = df1 %>% mutate_at(funs(replace(., is.na(.), 0)), .cols = c(1:12)),
    replace_na       = df1 %>% replace_na(list(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0, var6 = 0, var7 = 0, var8 = 0, var9 = 0, var10 = 0, var11 = 0, var12 = 0)),
    times = 1000L
)

print(op) #standard data frame of the output
    Unit: milliseconds
            expr       min       lq     mean   median       uq       max neval
  mut_all_ifelse 769.87848 844.5565 871.2476 856.0941 895.4545 1274.5610  1000
   mut_at_ifelse 713.48399 847.0322 875.9433 861.3224 899.7102 1006.6767  1000
         replace 258.85697 311.9708 334.2291 317.3889 360.6112  455.7596  1000
 mut_all_replace  96.81479 164.1745 160.6151 167.5426 170.5497  219.5013  1000
  mut_at_replace  96.23975 166.0804 161.9302 169.3984 172.7442  219.0359  1000
      replace_na 103.04600 161.2746 156.7804 165.1649 168.3683  210.9531  1000
boxplot(op) #boxplot of output

library(ggplot2) #nice log plot of the output
qplot(y=time, data=op, colour=expr) + scale_y_log10()

To test the subset assignment operator I had run these tests originally.

set.seed(24) 
> Book1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e8 *12, replace=TRUE),
+ dimnames = list(NULL, paste0("var", 1:12)), ncol=12))
> system.time({ 
+     Book1 %>% mutate_all(funs(ifelse(is.na(.), 0, .))) })
   user  system elapsed 
  52.79   24.66   77.45 
> 
> system.time({ 
+     Book1 %>% mutate_at(funs(ifelse(is.na(.), 0, .)), .cols = c(1:12)) })
   user  system elapsed 
  52.74   25.16   77.91 
> 
> system.time({ 
+     Book1[is.na(Book1)] <- 0 })
   user  system elapsed 
  16.65    7.86   24.51 
> 
> system.time({ 
+     Book1 %>% replace_na(list(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0, var6 = 0, var7 = 0, var8 = 0, var9 = 0,var10 = 0, var11 = 0, var12 = 0)) })
   user  system elapsed 
   3.54    2.13    5.68 
> 
> system.time({ 
+     Book1 %>% mutate_at(funs(replace(., is.na(.), 0)), .cols = c(1:12)) })
   user  system elapsed 
   3.37    2.26    5.63 
> 
> system.time({ 
+     Book1 %>% mutate_all(funs(replace(., is.na(.), 0))) })
   user  system elapsed 
   3.33    2.26    5.58 
> 
> system.time({ 
+     Book1 %>% replace(., is.na(.), 0) })
   user  system elapsed 
   3.42    1.09    4.51

In these tests the base replace() comes in first. In the benchmarking trials, the replace falls farther back in the ranks while the tidyr replace_na() wins (by a nose) Running the singular tests repeatedly and on different shapes and sizes of data frames always finds the base replace() in the lead.

Question 2: How could it's benchmark performance be the only result that falls so far out of line with the simple test results?

More perplexingly - Question 3: How can all the mutate_all/_at(replace()) work faster than the simple replace()? Many folks report this: http://datascience.la/dplyr-and-a-very-basic-benchmark/ (and all the links in that article) but I still haven't found an explanation for why beyond that hashing and C++ are used.)

with special thanks already to Tyler Rinker: https://www.r-bloggers.com/microbenchmarking-with-r/ and akrun: https://stackoverflow.com/a/41530071/5088194

Try wrapping it with `{}` -- `{ df1[is.na(df1)] <- 0 }`. BTW, note that `df1` and `Book1` are "integer" and in all your cases you are coercing to "numeric". Replacing `0` with `0L` should increase speed in everything. Also, when benchmarking, note that `Book1[is.na(Book1)] <- 0` replaces the actual `Book1` with a coerced `Book1` from "integer" to "numeric" and all subsequent cases have the advantage of _not_ having to coerce. To avoid coercing the original data wrap with a function or `local`. Finally, I think, an efficient way is `for(j in 1:ncol(df1)) df1[[j]][is.na(df1[[j]])] = 0L`. — alexis_laz, Jan 12 '17 at 08:13
@alexis_laz: Indeed! That answered most of the questions and helped me to begin seeing where/how mutables work in R. Would you like to drop that into an answer so I may select it? Additionally, could you possibly add an explanation for why your for loop with (or even without) the simplifying subsetting works so much more quickly than the rest of the options? — leerssej, Jan 13 '17 at 09:21
I added a more extended answer. The `for` loop is among the fastest alternatives because it does the minimum that needs to be done to replace a value in a vector. All the subsetting that takes place in the `for` loop is only with the primitive functions and not with "data.frame" methods for `[` and `[<-` which include significant overhead. The only thing that can "beat" (and not by significant amount) such a series of operations inside the loop, is modifying in place; something that base R does not support. — alexis_laz, Jan 13 '17 at 11:15
Thank you! This has been a really interesting set of introductions to the local and function wrapping, the appreciable benefits of cleaving to integers whenever possible, and how well the simpler loops can really boost speeds. Very informative and well crafted write up, too! — leerssej, Jan 13 '17 at 11:22
You 're welcome, glad you 've found it useful. BTW, note here that the "loop", actually, consists of only "vectorised" operations. A "data.frame" is a list (both literally and as an R object) of vectors. The above loop is essentially equivalent to `df1[[1]][is.na(df1[[1]])] = 0L; df1[[2]][is.na(df1[[2]])] = 0L; etc...` but wrapped as convenient and sane code in a `for` loop. Either user code, R functions, or internally, there has to be an iterative selection across the "data.frame" column-vectors. — alexis_laz, Jan 13 '17 at 12:50
@alexis_laz It is exceptionally helpful thank you. It is even helping to fuel our groups discussions about the relative merits of the Pandas vs base Pythons approaches to processing. I am however still mystified about how the dplyr and tidyr approaches can still manage to come in with times almost as good as the most stripped down for loop can. Do you know why this might be? (Where does the conversion into C++ happen and how?) — leerssej, Jan 16 '17 at 23:25
The operations in this particular case are straightforward: columns need to be looped over and replaced at their `NA`; any R code that does this should be efficient. In R code, this operation requires >2 scans of `nrow(df1)` for each column -- one to locate `NA`, one to replace at those indices and one to copy instead of modifying in place. There is nothing less (and more efficient) that can be done unless C code is used to make 1 pass through each column and modify in place (if possible). I'd guess that only more complex operations are ported to C/C++ code in dplyr. — alexis_laz, Jan 18 '17 at 19:13

score 4 · Accepted Answer · answered Jan 13 '17 at 11:09

You can include a complex/multi statement in microbenchmark by wrapping it with {} which, basically, converts to a single expression:

microbenchmark(expr1 = { df1[is.na(df1)] = 0 }, 
               exp2 = { tmp = 1:10; tmp[3] = 0L; tmp2 = tmp + 12L; tmp2 ^ 2 }, 
               times = 10)
#Unit: microseconds
#  expr        min         lq       mean     median         uq        max neval cld
# expr1 124953.716 137244.114 158576.030 142405.685 156744.076 284779.353    10   b
#  exp2      2.784      3.132     17.748     23.142     24.012     38.976    10  a

Worth noting the side-effects of this:

tmp
#[1]  1  2  0  4  5  6  7  8  9 10

in contrast to, say, something like:

rm(tmp)
microbenchmark(expr1 = { df1[is.na(df1)] = 0 },  
               exp2 = local({ tmp = 1:10; tmp[3] = 0L; tmp2 = tmp + 12L; tmp2 ^ 2 }), 
               times = 10)
#Unit: microseconds
#  expr       min         lq        mean     median         uq        max neval cld
# expr1 127250.18 132935.149 165296.3030 154509.553 169917.705 314820.306    10   b
#  exp2     10.44     12.181     42.5956     54.636     57.072     97.789    10  a 
tmp
#Error: object 'tmp' not found

Noticing the side effect a benchmark has, we see that the first operation that removes NA values leaves a fairly light job for the following alternatives:

# re-assign because we changed it before
set.seed(24)
df1 = as.data.frame(matrix(sample(c(NA, 1:5), 1e6 * 12, TRUE), 
                           dimnames = list(NULL, paste0("var", 1:12)), ncol = 12))
unique(sapply(df1, typeof))
#[1] "integer"
any(sapply(df1, anyNA))
#[1] TRUE
system.time({ df1[is.na(df1)] <- 0 })
# user  system elapsed 
# 0.39    0.14    0.53

The previous benchmark leaves us with:

unique(sapply(df1, typeof))
#[1] "double"
any(sapply(df1, anyNA))
#[1] FALSE

And replacing NA, when there are none, is/should be taken account to do nothing on the input.

Aside of that, note that in all your alternatives you sub-assign a "double" (typeof(0)) to "integer" columns-vectors (sapply(df1, typeof)). While, I don't think there is any case (in the above alternatives) where df1 gets modified in place (since after the creation of a "data.frame", there is stored info to copy its vector-columns in case of modification), there -still- is a minor but avoidable overhead in coercing to "double" and storing as a "double". R before replacing elements in an "integer" vector will allocate and copy (in case of "integer" replacement) or allocate and coerce (in case of "double" replacement). Also, after the first coercion (from a side effect of a benchmark, as noted above), R will operate on "double"s and that contains slower manipulations than on "integer"s. I can't find a straightforward R way to investigate that difference, but in a nutshell (in danger of not being totally accurate) we can simulate these operations by:

# simulate R's copying of int to int
# allocate a new int and copy
int2int = inline::cfunction(sig = c(x = "integer"), body = '
    SEXP ans = PROTECT(allocVector(INTSXP, LENGTH(x)));
    memcpy(INTEGER(ans), INTEGER(x), LENGTH(x) * sizeof(int));
    UNPROTECT(1);
    return(ans);
')
# R's coercing of int to double
# 'coerceVector', internally, allocates a double and coerces to populate it
int2dbl = inline::cfunction(sig = c(x = "integer"), body = '
    SEXP ans = PROTECT(coerceVector(x, REALSXP));
    UNPROTECT(1);
    return(ans);
')
# simulate R's copying form double to double
dbl2dbl = inline::cfunction(sig = c(x = "double"), body = '
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
    memcpy(REAL(ans), REAL(x), LENGTH(x) * sizeof(double));
    UNPROTECT(1);
    return(ans);
')

And on a benchmark:

x.int = 1:1e7; x.dbl = as.numeric(x.int)
microbenchmark(int2int(x.int), int2dbl(x.int), dbl2dbl(x.dbl), times = 50)
#Unit: milliseconds
#           expr      min       lq     mean   median       uq      max neval cld
# int2int(x.int) 16.42710 16.91048 21.93023 17.42709 19.38547 54.36562    50  a 
# int2dbl(x.int) 35.94064 36.61367 47.15685 37.40329 63.61169 78.70038    50   b
# dbl2dbl(x.dbl) 33.51193 34.18427 45.30098 35.33685 63.45788 75.46987    50   b

To conclude(!) the whole previous note, replacing 0 with 0L shall save some time...

Finally to replicate the benchmark in a more fair way, we could use:

library(dplyr)
library(tidyr)
library(microbenchmark) 
set.seed(24)
df1 = as.data.frame(matrix(sample(c(NA, 1:5), 1e6 * 12, TRUE), 
                            dimnames = list(NULL, paste0("var", 1:12)), ncol = 12))

Wrap in functions:

stopifnot(ncol(df1) == 12)  #some of the alternatives are hardcoded to 12 columns
mut_all_ifelse = function(x, val) x %>% mutate_all(funs(ifelse(is.na(.), val, .)))
mut_at_ifelse = function(x, val) x %>% mutate_at(funs(ifelse(is.na(.), val, .)), .cols = c(1:12))
baseAssign = function(x, val) { x[is.na(x)] <- val; x }
baseFor = function(x, val) { for(j in 1:ncol(x)) x[[j]][is.na(x[[j]])] = val; x }
base_replace = function(x, val) x %>% replace(., is.na(.), val)
mut_all_replace = function(x, val) x %>% mutate_all(funs(replace(., is.na(.), val)))
mut_at_replace = function(x, val) x %>% mutate_at(funs(replace(., is.na(.), val)), .cols = c(1:12))
myreplace_na = function(x, val) x %>% replace_na(list(var1 = val, var2 = val, var3 = val, var4 = val, var5 = val, var6 = val, var7 = val, var8 = val, var9 = val, var10 = val, var11 = val, var12 = val))

Test for equality of results before the benchmarks:

identical(mut_all_ifelse(df1, 0), mut_at_ifelse(df1, 0))
#[1] TRUE
identical(mut_at_ifelse(df1, 0), baseAssign(df1, 0))
#[1] TRUE
identical(baseAssign(df1, 0), baseFor(df1, 0))
#[1] TRUE
identical(baseFor(df1, 0), base_replace(df1, 0))
#[1] TRUE
identical(base_replace(df1, 0), mut_all_replace(df1, 0))
#[1] TRUE
identical(mut_all_replace(df1, 0), mut_at_replace(df1, 0))
#[1] TRUE
identical(mut_at_replace(df1, 0), myreplace_na(df1, 0))
#[1] TRUE

Test with coercing to "double":

benchnum = microbenchmark(mut_all_ifelse(df1, 0), 
                          mut_at_ifelse(df1, 0), 
                          baseAssign(df1, 0), 
                          baseFor(df1, 0),
                          base_replace(df1, 0), 
                          mut_all_replace(df1, 0),
                          mut_at_replace(df1, 0), 
                          myreplace_na(df1, 0),
                          times = 10)
benchnum
#Unit: milliseconds
#                    expr       min        lq      mean    median        uq       max neval cld
#  mut_all_ifelse(df1, 0) 1368.5091 1441.9939 1497.5236 1509.2233 1550.1416 1629.6959    10   c
#   mut_at_ifelse(df1, 0) 1366.1674 1389.2256 1458.1723 1464.5962 1503.4337 1553.7110    10   c
#      baseAssign(df1, 0)  532.4975  548.9444  586.8198  564.3940  655.8083  667.8634    10  b 
#         baseFor(df1, 0)  169.6048  175.9395  206.7038  189.5428  197.6472  308.6965    10 a  
#    base_replace(df1, 0)  518.7733  547.8381  597.8842  601.1544  643.4970  666.6872    10  b 
# mut_all_replace(df1, 0)  169.1970  183.5514  227.1978  194.0903  291.6625  346.4649    10 a  
#  mut_at_replace(df1, 0)  176.7904  186.4471  227.3599  202.9000  303.4643  309.2279    10 a  
#    myreplace_na(df1, 0)  172.4926  177.8518  199.1469  186.3645  192.1728  297.0419    10 a

Test without coercing to "double":

benchint = microbenchmark(mut_all_ifelse(df1, 0L), 
                          mut_at_ifelse(df1, 0L), 
                          baseAssign(df1, 0L), 
                          baseFor(df1, 0L),
                          base_replace(df1, 0L), 
                          mut_all_replace(df1, 0L),
                          mut_at_replace(df1, 0L),
                          myreplace_na(df1, 0L),
                          times = 10)
benchint
#Unit: milliseconds
#                     expr        min        lq      mean    median        uq       max neval cld
#  mut_all_ifelse(df1, 0L) 1291.17494 1313.1910 1377.9265 1353.2812 1417.4389 1554.6110    10   c
#   mut_at_ifelse(df1, 0L) 1295.34053 1315.0308 1372.0728 1353.0445 1431.3687 1478.8613    10   c
#      baseAssign(df1, 0L)  451.13038  461.9731  477.3161  471.0833  484.9318  528.4976    10  b 
#         baseFor(df1, 0L)   98.15092  102.4996  115.7392  107.9778  136.2227  139.7473    10 a  
#    base_replace(df1, 0L)  428.54747  451.3924  471.5011  470.0568  497.7088  516.1852    10  b 
# mut_all_replace(df1, 0L)  101.66505  102.2316  137.8128  130.5731  161.2096  243.7495    10 a  
#  mut_at_replace(df1, 0L)  103.79796  107.2533  119.1180  112.1164  127.7959  166.9113    10 a  
#    myreplace_na(df1, 0L)  100.03431  101.6999  120.4402  121.5248  137.1710  141.3913    10 a

And a simple way to visualize:

boxplot(benchnum, ylim = range(min(summary(benchint)$min, summary(benchnum)$min),
                               max(summary(benchint)$max, summary(benchnum)$max)))
boxplot(benchint, add = TRUE, border = "red", axes = FALSE) 
legend("topright", c("coerce", "not coerce"), fill = c("black", "red"))

Note that df1 is unchanged after all this (str(df1)).

Fyi, your student (the OP) has attempted to build on what you did here, but still seems confused about what causes silent coercion and what `[[` does: http://stackoverflow.com/a/41585689/ I'm tired of trying to explain it to them, but just letting you know in case you're interested. — Frank, Mar 07 '17 at 19:55

Discrepancies in Benchmark and Processing Time Results

1 Answers1

Linked