I would like to create a function using Rcpp that could outperform the pmax function from R base. I also tried to handle missing values inside of Rcpp function, and this might not be a good idea. All vector must have some missing values and they are all positive values. That is the reason I recoded missing to -1, so I could add it back in case a maximum value does not exists if all values are missing.
This was my first attempt, but no success yet:
library("benchr")
library("Rcpp")
Pmax <- function(...) {
argd_list <- list(...)
cppFunction("
NumericVector cpp_pmax(List args) {
List args0 = args[0];
int n_arg = args.length();
int n_vec = args0.length();
NumericVector out(n_vec);
out = args[0];
for (int i = 1; i < n_arg; ++i) {
NumericVector pa(n_vec);
pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (R_IsNA(out[j])) {
out[j] = -1;
}
if (R_IsNA(pa[j])) {
pa[j] = -1;
}
out[j] = std::max(out[j], pa[j]);
}
}
for (int j = 0; j < n_vec; ++j) {
if (out[j] == -1) {
out[j] = NA_REAL;
}
}
return out;
}
")
output <- cpp_pmax(argd_list)
return(output)
}
n <- 200000
x1 <- sample(0:1, n, replace = TRUE)
y1 <- sample(0:1, n, replace = TRUE)
z1 <- sample(0:1, n, replace = TRUE)
x1[sample(1:n, 90)]<-NA
y1[sample(1:n, 60)]<-NA
z1[sample(1:n, 70)]<-NA
pm1 <- Pmax(x1, y1, z1)
pm2 <- pmax(x1, y1, z1, na.rm = TRUE)
all(pm1 == pm2)
benchr::benchmark(pmax(x1, y1, z1, na.rm = TRUE),
Pmax(x1, y1, z1))
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
pmax(x1, y1, z1, na.rm = TRUE) 100 1.34 1.37 1.39 1.44 1.46 1.74 144 1.00
Pmax(x1, y1, z1) 100 13.30 13.50 13.80 19.90 15.70 67.50 1990 9.88
Edit:
I've removed some loops and just replaced the -1 by NA outside of Rcpp, and it speed up a little bit, but still not outperform the R base pmax.
Although Rcpp::pmax is a nice implementation it only handles two vector and not sure if it can handle missing values. I got different results when there are missing vales.
The second attempt was:
Pmax1 <- function(...) {
args_list <- list(...)
cppFunction("
NumericVector cpp_pmax(List args) {
List args0 = args[0];
int n_arg = args.length();
int n_vec = args0.length();
NumericVector out(n_vec);
out = args[0];
for (int i = 1; i < n_arg; ++i) {
NumericVector pa(n_vec);
pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (R_IsNA(out[j])) {
out[j] = -1;
}
if (R_IsNA(pa[j])) {
pa[j] = -1;
}
out[j] = std::max(out[j], pa[j]);
}
}
return out;
}
")
output <- cpp_pmax(args_list)
output[output == -1] <- NA
return(output)
}
Pmax2 <- function(...) {
args_list <- list(...)
cppFunction("
NumericVector cpp_pmax(List args) {
NumericVector out = args[0];
int n_arg = args.length();
int n_vec = out.length();
for (int j = 0; j < n_vec; ++j) {
if (NumericVector::is_na(out[j])) out[j] = -1;
}
for (int i = 1; i < n_arg; ++i) {
NumericVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (NumericVector::is_na(pa[j])) pa[j] = -1;
out[j] = std::max(out[j], pa[j]);
}
}
return out;
}
")
output <- cpp_pmax(args_list)
output[output == -1] <- NA
return(output)
}
n <- 200000
x <- sample(0:5, n, replace = TRUE)
y <- sample(0:5, n, replace = TRUE)
z <- sample(0:5, n, replace = TRUE)
w <- sample(0:5, n, replace = TRUE)
x[sample(1:n, 900)]<-NA
y[sample(1:n, 600)]<-NA
z[sample(1:n, 700)]<-NA
z[sample(1:n, 800)]<-NA
benchr::benchmark(pmax(x, y, z, w, na.rm = TRUE),
Pmax1(x, y, z, w),
Pmax2(x, y, z, w))
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
pmax(x, y, z, w, na.rm = TRUE) 100 2.38 2.43 2.46 2.46 2.48 2.6 246 1.00
Pmax1(x, y, z, w) 100 16.00 16.90 17.20 19.40 17.70 61.2 1940 6.98
Pmax2(x, y, z, w) 100 9.44 9.74 9.90 11.30 10.10 45.6 1130 4.02
Does anyone has an idea on how to make it faster than the R base pmax?
The idea was to have a generalized function to handle different number of vectors, all inside Rcpp function.
Update based on @DirkEddelbuettel and @Cole answer
Thank you for helping in optimize the code. Inspired by @DirkEddelbuettel and @Cole answers I just add the Rcpp::pmax to remove one of the loop and it helped also to speed it up.
library("bench")
library("Rcpp")
cppFunction("
IntegerVector cpp_pmax1(List args) {
IntegerVector tmp = args[0];
IntegerVector out = clone(tmp);
int n_arg = args.length();
int n_vec = out.length();
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (pa[j] > out[j]) out[j] = pa[j];
}
}
return out;
}
")
cppFunction("
IntegerVector cpp_pmax2(List args) {
IntegerVector tmp = args[0];
IntegerVector out = clone(tmp);
int n_arg = args.length();
int n_vec = out.length();
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
out = pmax(out, pa);
}
return out;
}
")
Pmax1 <- function(...) {
cpp_pmax1(list(...))
}
Pmax2 <- function(...) {
cpp_pmax2(list(...))
}
n <- 200000
x <- sample(0:5, n, replace = TRUE)
y <- sample(0:5, n, replace = TRUE)
z <- sample(0:5, n, replace = TRUE)
w <- sample(0:5, n, replace = TRUE)
k <- sample(0:5, n, replace = TRUE)
x[sample(1:n, 900)] <- NA
y[sample(1:n, 600)] <- NA
z[sample(1:n, 700)] <- NA
w[sample(1:n, 800)] <- NA
k[sample(1:n, 800)] <- NA
pm0 <- pmax(x, y, z, w, k, na.rm = TRUE)
pm1 <- Pmax1(x, y, z, w, k)
pm2 <- Pmax2(x, y, z, w, k)
benchr::benchmark(pmax(x, y, z, w, k, na.rm = TRUE),
Pmax1(x, y, z, w, k),
Pmax2(x, y, z, w, k))
Benchmark summary:
Time units : microseconds
expr n.eval min lw.qu median mean up.qu max total relative
pmax(x, y, z, w, k, na.rm = TRUE) 100 2880 2900 2920 3050 3080 8870 305000 5.10
Pmax1(x, y, z, w, k) 100 2150 2180 2200 2310 2350 8060 231000 3.85
Pmax2(x, y, z, w, k) 100 527 558 572 812 719 7870 81200 1.00
Thank you!