1

I would like to build a new column in a data.frame myDF which is the value returned for each row by a function getval taking the elements in this row as arguments. getval also uses an external vector v1 as argument. For example:

myn = 1000
a = seq(0, 1, length.out = myn)
b = seq(-1, 1, length.out = myn)
myDF = expand.grid(a=a, b=b)

set.seed(13)
v1 = rnorm(100)

getval = function(a, b, v) {
  return(sum(a*v + b/2*v))
}

myDF$val = apply(myDF, 1, function(x) {getval(a=x[1], b=x[2], v=v1)})
head(myDF)
#             a  b      val
# 1 0.000000000 -1 3.091267
# 2 0.001001001 -1 3.085078
# 3 0.002002002 -1 3.078889
# 4 0.003003003 -1 3.072700
# 5 0.004004004 -1 3.066512
# 6 0.005005005 -1 3.060323

But this is too slow (here ~4 seconds, but increasing a lot for higher myn).

I am looking for the fastest way to implement this - Contest! ;-)

All solutions (incl. parallelizing?) and packages (dplyr, data.table?) are welcome - I really need something as fast as possible for myn = 5000 for example.


EDIT Actually, getval is not so (easily?) vectorizable...

getval = function(a, b, v) {
  return(sum(a/(a/v +1) + b/(b+2) * v))
}

myDF$val = apply(myDF, 1, function(x) {getval(a=x[1], b=x[2], v=v1)})
head(myDF)
#             a  b      val
# 1 0.000000000 -1 6.182533
# 2 0.001001001 -1 6.282782
# 3 0.002002002 -1 6.383424
# 4 0.003003003 -1 6.484682
# 5 0.004004004 -1 6.586980
# 6 0.005005005 -1 6.691260
ztl
  • 2,512
  • 1
  • 26
  • 40

2 Answers2

6

You should try as hard as possible to avoid looping over rows. For your example:

getval = function(a, b, v) {
  return((a + b / 2) *sum(v))
}

myDF$val1 = getval(myDF$a, myDF$b, v1)
head(myDF)
#            a  b      val     val1
#1 0.000000000 -1 3.091267 3.091267
#2 0.001001001 -1 3.085078 3.085078
#3 0.002002002 -1 3.078889 3.078889
#4 0.003003003 -1 3.072700 3.072700
#5 0.004004004 -1 3.066512 3.066512
#6 0.005005005 -1 3.060323 3.060323

You won't be able to beat performance of such a vectorized solution. If this is not possible in R, implement everything (including the loop) with Rcpp. It's not difficult with such simple functions.

Edit:

Here is an Rcpp function for your second example. It's quite simple because of Rcpp sugar functions such as sum.

library(Rcpp)
cppFunction(
  "
  NumericVector rcpp_geval(const NumericVector a, const NumericVector b, const NumericVector v) {
    const double n = a.length();
    NumericVector res(n);
    for (double i = 0; i < n; ++i) {
      res[i] = sum(a[i]/(a[i]/v +1) + b[i]/(b[i]+2) * v);
    }
    return res;
  }
  "
)

myDF$val1 <- rcpp_geval(myDF$a, myDF$b, v1)

head(myDF)
#            a  b      val     val1
#1 0.000000000 -1 6.182533 6.182533
#2 0.001001001 -1 6.282782 6.282782
#3 0.002002002 -1 6.383424 6.383424
#4 0.003003003 -1 6.484682 6.484682
#5 0.004004004 -1 6.586980 6.586980
#6 0.005005005 -1 6.691260 6.691260
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Wow great! Very impresive, thanks! Unfortunately, my reproducible example was not so good, as the `getval` function is slightly more complicated --> can you still achieve this with the updated function? (Note: a solution illustrating Rcpp would be very helpful too!) – ztl Feb 02 '21 at 09:39
  • Great solution using `Rcpp`!! THANKS so much, I learnt a lot! (in the meantime, following your first advice, I extracted from my real function the non-vectorizable part to vectorize afterwards, encountering the new problem of how to vectorize on a series of vectors. As you seem particularly comfortable with this, here is the question just in case: https://stackoverflow.com/questions/66007991/how-to-vectorize-an-operation-on-a-series-of-vectors-in-r) – ztl Feb 02 '21 at 10:52
  • @ztl I don't see how that is different to your second example here. Just write an Rcpp function where instead of `v` you pass a matrix, which you then subset inside the loop. – Roland Feb 02 '21 at 11:59
  • It is not different - I was not aware of your Rcpp solution yet and attempted to vectorize as per your first advice, which lead me to another problem (hence another question). But your Rcpp solution definitely beats the others! Thanks again! – ztl Feb 02 '21 at 12:10
2

You can use mapply in base R :

myDF$val <- mapply(function(x, y) getval(x, y, v1), myDF$a, myDF$b)

Writing it using data.table :

library(data.table)
setDT(myDF)
myDF[, val := Map(function(x, y) getval(x, y, v1), a, b)]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213