-2

I have a double loop in R. It works well, but the problem is that it runs slow with big data frames. So I would like to do the loop in C++ through the Rcpp package, but using an R function inside the loop. The R loop is:

> output2=list()
> for (j in r){
+   for (i in 1:nrow(DF)){
+     output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+   }
+ }

And the output is going to be a list. An example of DF and r is:

 > r
 [1] "A==A[i] & B==B[i] "           "A==A[i] & C==C[i] "          
 [3] "B==B[i] & C==C[i] "           "A==A[i] & B==B[i] & C==C[i] "
 > DF
    A  B  C
 1 11 22 88
 2 11 22 47
 3  2 30 21
 4  3 30 21

My question is how I can put the expression in the C++ code. Another question is whether this way is better than make the entire code in C++. I would be grateful if someone could help me with this issue. Regards,

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
Citizen
  • 121
  • 15
  • have you tried `lapply` or `foreach`? – Sixiang.Hu Jan 03 '18 at 12:34
  • I seriously doubt that writing you loop in C++ is going to help (also using `lapply` and friends is not going to help). Please show a complete code example, so that we can give a proper advice (https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Using `nrow(subset(...))` for example, seems quite suspicious. – Jan van der Laan Jan 03 '18 at 12:37
  • j runs through a vector and i runs through the dataframe rows, and lappy did not work well for me, but I have not tried with foreach, I do not know how that function works. Thanks. – Citizen Jan 03 '18 at 12:40
  • Ok, slightly better. Please look at the link I gave you, and also show us (an example) of `r` and `DF`. – Jan van der Laan Jan 03 '18 at 12:45
  • I'll remove the Rcpp tag. This has nothing to do with Rcpp, apart from wishing for a free pony. – Dirk Eddelbuettel Jan 03 '18 at 13:48

2 Answers2

1

For loops aren't necessarily slow in R. It is calling a set of functions a very large number of times, which can be slow (an with more recent versions of R, even that isn't as slow as it was). However, for loops can often be completely avoided by using vectorised code which is many times faster.

In general using eval and parse is not needed and generally an indication that a suboptimal solution is used. In this case (without knowing the complete problem), I am not completely sure how to avoid that. However by writing the loops more efficient a speed gain of over a factor 20 can be gained without using Rcpp.

Generate data

r <- c("A==A[i] & B==B[i]", "A==A[i] & C==C[i] ", "B==B[i] & C==C[i] ",
  "A==A[i] & B==B[i] & C==C[i] ")

DF <- read.table(textConnection(" A  B  C
1 11 22 88
2 11 22 47
3  2 30 21
4  3 30 21"))
DF <- DF[sample(nrow(DF), 1E3, replace=TRUE), ]

Measure time of initial implementation

> system.time({
+   output2=list()
+   for (j in r){
+    for (i in 1:nrow(DF)){
+      output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+    }
+   }
+ })
   user  system elapsed 
  1.120   0.007   1.127 

Preallocate result; doesn't help much in this case

> system.time({
+   output2=vector(length(r), mode = "list")
+   names(output2) <- r
+   for (j in r){
+     output2[[i]] <- numeric(nrow(DF))
+      for (i in 1:nrow(DF)){
+        output2[[j]][i]=nrow(subset(DF,eval(parse(text=j))))
+      }
+   }
+ })
   user  system elapsed 
  1.116   0.000   1.116 

subset is not needed as we only need the number of rows. subset ceates a completely new data.frame, which generates overhead

> system.time({
+   output2=vector(length(r), mode = "list")
+   names(output2) <- r
+   for (j in r){
+     output2[[i]] <- numeric(nrow(DF))
+      for (i in 1:nrow(DF)){
+        output2[[j]][i]=sum(eval(parse(text=j), envir = DF))
+      }
+   }
+ })
   user  system elapsed 
  0.622   0.003   0.626 

Parsing r takes time and is repeated nrow(DF) times, remove form inner loop

> system.time({
+   output2=vector(length(r), mode = "list")
+   names(output2) <- r
+   for (j in r){
+     output2[[i]] <- numeric(nrow(DF))
+     expr <- parse(text=j)
+      for (i in 1:nrow(DF)){
+        output2[[j]][i]=sum(eval(expr, envir = DF))
+      }
+   }
+ })
   user  system elapsed 
  0.054   0.000   0.054 

A more readable and even faster implementation using dplyr

> library(dplyr)
> system.time({
+ output3 <- DF %>% group_by(A,B) %>% mutate(a = n()) %>%
+   group_by(A,C) %>% mutate(b = n()) %>%
+   group_by(B,C) %>% mutate(c = n()) %>%
+   group_by(A,B,C) %>% mutate(d = n()) 
+ })
   user  system elapsed 
  0.010   0.000   0.009 
Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35
  • Thank you very much. r and DF are vector and data frame examples, so r can be bigger. If I use dplyr, I think I would have to change the code for each r, that is why I think it is not practical. – Citizen Jan 04 '18 at 08:02
  • I have run the previous code, with length(r)=120 and dim(DF)=11540 7 , and it takes much time. That is why, I believe the fastest way could be with Rcpp. – Citizen Jan 04 '18 at 08:16
0

I would have preferred to post this in comment as it doesn't fully answer the question but I don't have enough reputation to do so.

R is an interpreted language whereas C is a compiled one. Loops are slow in R but your expression output2[[j]][i]=nrow(subset(DF,eval(parse(text=j)))) represents at least 99% of the execution time. Therefore, it won't help to find a way to mix both languages. I advise you to keep both in R and find a way to speed up the process (maybe only one loop with a different expression ?) or find a way to translate your expression to a C one. I know that a lot of basic functions of R are coded in C (as you can see here), maybe it's already the case for nrow, subset and parse.

You can also use LAPACK/BLAS library to speed up some R functions:

LAPACK/BLAS handles matrix math in R. If that's all you need, you can find libraries that are much faster than the vanilla ones in R (you can use some of them in R too to improve performance!).

stated from this topic from stack overflow

txemsukr
  • 1,017
  • 1
  • 10
  • 32
  • 1
    I could try to do the entire code in C++, but the issue is how I can translate nrow, subset, eval and parse in an easy way. Thank you. – Citizen Jan 03 '18 at 12:58
  • Added LAPACK/BLAS to my answer, maybe it can help you – txemsukr Jan 03 '18 at 13:12
  • LAPACK or BLAS are not relevant for performance if you don't do matrix algebra. – Roland Jan 03 '18 at 13:48
  • I am familiar with matrix algebra, but not with LAPACK/BLAS. My question is whether there is a way to do the same loop using matrix algebra. Thanks. – Citizen Jan 04 '18 at 08:22