1

I am trying to do the following operation in R for nrow=300,000 simulations (on ncol=30 variables):

down vote

accept

here's my code:

FS_DF <- read.csv("fs.csv", sep = ",")
Y_DF <- read.csv("Y.csv", sep = ",")
CALIBSCENS_DF <- read.csv("calib_scens.csv", sep = ",")
Y_DF$X <- NULL

X_mat <- matrix(1:1, nrow(CALIBSCENS_DF), nrow(FS_DF))

for (irow in 1:nrow(CALIBSCENS_DF)) { 
for (jrow in 1:nrow(FS_DF)) { 
for (krow in 1:ncol(FS_DF)) { 
    X_mat [irow, jrow] <- X_mat[irow, jrow] * (CALIBSCENS_DF[irow, krow] ^ FS_DF[jrow, krow])

}}}

fit <- .lm.fit(X_mat, as.matrix(sapply(Y_DF, as.numeric)))

Its taking forever to fill my X matrix. Can someone suggest a faster approach to do this operation. SCENS_DF, FS_DF are data frames. X_mat is a matrix.

Rishi
  • 19
  • 4
  • 5
    It looks like `X_mat * (CALIBSCENS_DF ^ FS_DF)` should work as these are elementwise operations. – lmo Aug 10 '17 at 13:10
  • Loops are quite slow in R. It's better to use one of the apply functions (which internally also use a loop but implemented in C). A good intro to these functions can be found in this answer: https://stackoverflow.com/questions/4162363/how-to-start-a-for-loop-in-r-programming Additionally your code is slow because the size of the matrix is not predefined which is also slow in R. Create an empty matrix of the expected dimensions an fill them later. – JereB Aug 10 '17 at 13:11
  • thanks Imo and JereB. I do pre-define X_mat as X_mat <- matrix(1:1, nrow(CALIBSCENS_DF), nrow(FS_DF)). Could you please suggest how the apply function wilol apply. I am a bit new to R. Thanks so much. – Rishi Aug 10 '17 at 13:15
  • @JereB Not entirely true. See [The R Inferno](http://www.burns-stat.com/documents/books/the-r-inferno/). – Roman Luštrik Aug 10 '17 at 13:20
  • @RomanLuštrik Thanks for the link. Which part do you refer to in particular? – JereB Aug 10 '17 at 13:22
  • @Imo do you mean take out the [jrow, krow]? how would the correct indices be recognized? – Rishi Aug 10 '17 at 13:32
  • @JereB Ch. 4 "A common reflex is to use a function in the apply family. This is not vectorization, it is loop-hiding. The apply function has a for loop in its definition. The lapply function buries the loop, but execution times tend to be roughly equal to an explicit for loop. (Confusion over this is understandable, as there is a significant difference in execution speed with at least some versions of S+.)" – Benjamin Aug 10 '17 at 13:33
  • @Benjamin, do you mean to say I cannot improve on my code and it is optimum as it is? – Rishi Aug 10 '17 at 13:34
  • 1
    @Rishi, no, I think your code could be written better. I'm trying to figure out what it is your code is doing. A reproducible example would be really helpful. do `CALIBCENS_DF`, `X_mat` and `FS_DF` all have the same dimensions? – Benjamin Aug 10 '17 at 13:36
  • Also, you're converting `CALIBCENS_DF` and `FS_DF` to matrices on each iteration. You could save a lot of time by converting those to matrices before the loop. Though I suspect that the loop could be avoided altogether with some clever linear algebra. – Benjamin Aug 10 '17 at 13:37
  • @Benjamin It is a matrix operation for linear multireg. CALIBSCENS [300k,30], FS_DF[1000,30] and X-mat[300k,1000]. I need the X matrix as X_mat * (CALIBSCENS_DF ^ FS_DF). so I can do lm(Y ~., data=cbind(y,X_mat)). – Rishi Aug 10 '17 at 13:39
  • ok, added my actual code here rather than as an answer. – Rishi Aug 10 '17 at 14:22
  • @JereB try about pre-allocating. – Roman Luštrik Aug 11 '17 at 07:26

3 Answers3

2

If this code is your bottleneck and you use loops, thats always a good sign that cpp might yield good results. We can use Rcpp to make it easier and have the cpp-function within our code.

Below you find my approach using Rcpp and some benchmarks against minem's approach, shaving off roughly 20% of runtime (highly depending on the sizes of the matrices).

library(Rcpp) # load the Rcpp library

# create some data...
CALIBSCENS_DF <- matrix(2:5, nrow = 2)
FS_DF <- matrix(2:5, nrow = 2)

# create the cpp-function, basically the same as yours, just adapted to cpp
cppFunction("
NumericMatrix cpp_fun(NumericMatrix A, NumericMatrix B) {
    NumericMatrix retMax(A.nrow(), B.nrow());

    long double mult;
    for (int irow = 0; irow < A.nrow(); irow++) {
        for (int jrow = 0; jrow < B.nrow(); jrow++) {
            mult = 1;
            for (int krow = 0; krow < B.ncol(); krow++) {
                mult *= pow(A(irow, krow), B(jrow, krow));
            }
            retMax(irow, jrow) = mult;
        }
    }
    return retMax;
}
")
# execute the function called 'cpp_fun' in R
cpp_mat <- cpp_fun(CALIBSCENS_DF, FS_DF)
cpp_mat
# [,1]  [,2]
# [1,] 1024  8192
# [2,] 5625 84375

Compare the function to the result shown by Minem

# for comparison, use Minems function
minem_fun <- function(A_mat, B_mat) {
  X <- matrix(1, ncol = nrow(B_mat), nrow = nrow(A_mat))
  for (irow in 1:nrow(A_mat)) {
    for (jrow in 1:nrow(B_mat)) {
      X [irow, jrow] <- prod(A_mat[irow, ] ^ B_mat[jrow, ])
    }
  }
  return(X)
}
minem_mat <- minem_fun(CALIBSCENS_DF, FS_DF)

identical(cpp_mat, minem_mat)
# [1] TRUE

Speed-benchmark

library(microbenchmark)
# small data
microbenchmark(
  minem = minem_fun(CALIBSCENS_DF, FS_DF),
  cpp = cpp_fun(CALIBSCENS_DF, FS_DF),
  times = 1000
)
# Unit: microseconds
# expr   min     lq      mean median     uq      max neval
# minem 9.386 10.239 11.198179  10.24 10.667   49.915  1000
# cpp 1.707  2.560  3.954538   2.56  2.987 1098.980  1000


# larger data
n <- 200
CALIB_large <- matrix(rnorm(n^2, mean = 100, sd = 10), nrow = n, ncol = n)
FS_large <- matrix(rnorm(n^2, mean = 2, sd = 0.5), nrow = n, ncol = n)

microbenchmark(
  minem = minem_fun(CALIB_large, FS_large),
  cpp = cpp_fun(CALIB_large, FS_large),
  times = 10
)
# Unit: seconds
# expr      min       lq     mean   median       uq      max neval
# minem 1.192011 1.197783 1.209692 1.201320 1.230812 1.238446    10
# cpp 1.009908 1.019727 1.023600 1.025791 1.028152 1.029427    10

Does that help you out?

David
  • 9,216
  • 4
  • 45
  • 78
  • Rcpp looks great. I will try it out. Thanks so much everyone. – Rishi Aug 10 '17 at 17:32
  • Help! Get Error: Loading required package: Rcpp Error in cppFunction("\nNumericMatrix cpp_fun(NumericMatrix A, NumericMatrix B) \n{\n\tNumericMatrix retMax(A.nrow(), B.nrow());\n\n long double mult;\n \n\tfor (int irow = 0; irow < A.nrow(); irow++) {\n for (int jrow = 0; jrow < B.nrow(); jrow++) {\n mult = 1;\n for (int krow = 0; krow < B.ncol(); krow++) {\n mult *= pow(A(irow, krow), B(jrow, krow));\n }\n retMax(irow, jrow) = mult;\n }\n }\n return retMax;\n}\n") : could not find function "cppFunction". – Rishi Aug 11 '17 at 07:50
  • Do you have `Rcpp` properly installed? I.e., does `Rcpp::evalCpp("2+2")` return 4? – David Aug 11 '17 at 08:13
  • Correctly spotted. I don't. RStudio tries to install Rtools to build it. But tries to do it in c drive and fails. ERROR: Setup was unable to create directory “C:\RBuildTools”. Error 5: Access Denied. How can I get it to install it in my custom directory? Thanks. – Rishi Aug 23 '17 at 14:54
  • You can try to install it manually by following this [link](https://cran.r-project.org/bin/windows/Rtools/). – David Aug 23 '17 at 15:17
  • thanks. Don't think downloading an exe file is allowed in my company. probably wait for the IT department to look into it. – Rishi Aug 23 '17 at 15:57
1

It looks like we can remove one loop this way:

CALIBSCENS_DF <- matrix(2:5, nrow = 2)
FS_DF <- matrix(2:5, nrow = 2)
X <- matrix(1, ncol = nrow(FS_DF), nrow = nrow(CALIBSCENS_DF))
for (irow in 1:nrow(CALIBSCENS_DF)) { 
  for (jrow in 1:nrow(FS_DF)) { 
      X [irow, jrow] <-
        X[irow, jrow] * prod(CALIBSCENS_DF[irow, ] ^ FS_DF[jrow, ])
    }}
X
#      [,1]  [,2]
# [1,] 1024  8192
# [2,] 5625 84375
minem
  • 3,640
  • 2
  • 15
  • 29
  • this really helped. for 1k simulation, run times fall from 3 mins to 1 min. Thanks so much. – Rishi Aug 10 '17 at 14:27
  • 1
    Given that `X` initially is always 1, you can shorten the code to `X[irow, jrow] <- prod(CALIB...)` – David Aug 10 '17 at 14:41
0

This isn't really an answer to your question yet, but won't fit in a comment. I think we need to take a good look at what you're attempting to do and decide if the for loop is doing what you think it is.

Let's simplify the code a little. Let's have matrices X, C, and F and define the loop

for (i in 1:nrow(C)){
  for (j in 1:nrow(F)){
    for (k in 1:ncol(F)){
      X[i, j] <- X[i, j] * C[i, k] ^ F[j, k]
    }
  }
}

Now let's step through what will happen as the loop iterates

i = 1; j = 1; k = 1      X[1, 1] <- X[1, 1] * C[1, 1] ^ F[1, 1]
i = 1; j = 1; k = 2      X[1, 1] <- X[1, 1] * C[1, 2] ^ F[1, 2]
i = 1; j = 1; k = 3      X[1, 1] <- X[1, 1] * C[1, 3] ^ F[1, 3]
...
i = 1; j = 1; k = 30      X[1, 1] <- X[1, 1] * C[1, 30] ^ F[1, 30]

Ultimately, X[1, 1] is dependent on C[1, 30] and F[1, 30]. You've done 29 iterations that have been overwritten. At this point, the loop will increment j and you'll get

i = 1; j = 2; k = 1      X[1, 2] <- X[1, 2] * C[1, 1] ^ F[2, 1]
i = 1; j = 2; k = 2      X[1, 2] <- X[1, 2] * C[1, 2] ^ F[2, 2]
i = 1; j = 2; k = 3      X[1, 2] <- X[1, 2] * C[1, 3] ^ F[2, 3]
...
i = 1; j = 2; k = 30      X[1, 2] <- X[1, 2] * C[1, 30] ^ F[2, 30]

So X[1, 2] is dependent on C[1, 30] and F[2, 30].

Is this the behavior you are expecting?

Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • Yes I need to iteratively populate X[i,j] with CALIBSCENS[i,k] ^ FS_DF[j,k]. – Rishi Aug 10 '17 at 14:04
  • But is it accurate that you only need the 30th column of `C` and `F`, because that is what your code is doing. – Benjamin Aug 10 '17 at 15:06
  • isn't the second iteration taking the X values from the 1st iteration. so I am using previous C and FS for later iterations. – Rishi Aug 10 '17 at 17:29