2

Whenever I replace a for loop with an apply statement, my R scripts run faster but here's an exception. I'm still inexperienced in using the apply family correctly, so what I can do to the apply statements to outperform (ie. become faster) than the for loop?

Example data:

vc<-as.character(c("120,129,129,114","103,67,67,67,67,10,10,10,12","2,1,1,1,2,4,3,1,1,1,3,2,1,1","1,3,1,1,1,1,1,4",NA,"5","1,1,99","2,2,2,16,11,11,11,11,11,29,29,26,26,26,26,26,26,26,26,26,26,31,24,29,29,29,29,40,24,23,3,3,3,6,6,4,5,4,4,3,3,4,4,6,8,8,6,6,6,5,3,3,4,4,5,5,4,4,4,4,6,11,10,11,10,14,2,2,22,22,22,22,24,24,24,23,24,24,24,23,24,23,23,23,24,25,27,27,24,24,26,24,25,25,24,25,26,29,31,32,32,32,32,33,32,35,35,35,52,44,37,26","20,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,19,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,19,19,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,17,1,1,1,12,10","67,63,73,70,75,135,94,94,96,94,95,96,96,97,94,94,94,94,24,24,24,24,24,24,24,24,24,24,24,1,1,1"))

The goal is to populate a numeric matrix m.res where each row contains the top3 values of each element in vc. Here's the for loop:

fx.test1 
function(vc) 
     {
     m.res<-matrix(ncol=3, nrow=length(vc))
     for (j in 1:length(vc)) 
      {vn<-as.numeric(unlist(strsplit(vc[j], split=","))) 
      vn[is.na(vn)]<-0; vn2<-rev(sort(vn)) 
      m.res[j,]<-vn2[1:3]
      }
     }

And below is my "apply solution". Why is it slower? How can I make it faster? Thank you!

fx.test2
function(vc) 
    {
    m.res<-matrix(ncol=3, nrow=length(vc))
    vc[is.na(vc)]<-"0"
    ls.vc<-sapply(vc, function(x) tail(sort(as.numeric(unlist(strsplit(x, split=",")))),3), simplify=TRUE)
    #names(ls.vc)<-seq(1:length(vc))
    ls.vc2<-lapply(ls.vc, function(x) c(as.numeric(x), rep(0, times = 3 - length(x))))
    m.res<-as.matrix(t(as.data.frame(ls.vc)))
    return(m.res)
}

system.time(m.res<-fx.test1(vc))
#   user  system elapsed 
#  0.001   0.000   0.001 

system.time(m.res<-fx.test2(vc))
#   user  system elapsed 
#  0.003   0.000   0.003

UPDATE: I followed the suggestions of @John and generated two trimmed & truly equivalent functions. Indeed, I was able to speed up the lapply function somewhat but it's still SLOWER than the for loop. If you happen to have any ideas for how optimize these functions for speed, please let me know. Thank you all.

fx.test3<-function(vc) 
{
    L<-strsplit(vc,split=",")
    m.res<-matrix(ncol=3, nrow=length(vc))
    for (j in 1:length(vc)) 
        {
        m.res[j,]<-sort(c(as.numeric(L[[j]]),rep(0,3)), decreasing=TRUE)[1:3]
    }
    return(m.res)
}



fx.test4<-function(vc) 
    {
        L<-strsplit(vc, split=",")
        D<-t(as.data.frame(lapply(L, function(X) {sort(c(as.numeric(X),rep(0,3)),decreasing=TRUE)[1:3]})))
        row.names(D)<-NULL
        m.res<-as.matrix(D)
        return(m.res)
    }

system.time(fx.test3(vc))
#   user  system elapsed 
#  0.001   0.000   0.001

system.time(fx.test4(vc))
#   user  system elapsed 
#  0.002   0.000   0.002 
reviewer3
  • 243
  • 3
  • 11
  • `apply` family commands are rarely much if any faster than a loop. You're pre-allocating which is good. Can you rewrite using truly vectorized functions (`sum`, `rowSum`, etc.)? Have you seen this post? http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r/8474941#8474941 – Ari B. Friedman Nov 02 '13 at 00:52
  • @AriB.Friedman I don't really agree with that. `*apply` can often be *much* faster, if used in the right context (but let's remember they are still loops themselves). Conversely, `for` loops can actually be more efficient and *meaningful* if used in the right context (i.e. when you expect a side-effect not a return value - I hate seeing people use `lapply` to write a bunch of `data.frame`'s to file for instance). – Simon O'Hanlon Nov 02 '13 at 11:27
  • @SimonO101 I agree that sometimes they can be faster, but in most cases they're not. C.f. http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar . It's a common misconception that because `*apply` are more "R-like" and other R-like ways of programming (e.g. true vectorization with `sum`, etc.) are faster, that `*apply` is always much faster. And it doesn't seem to typically be true. – Ari B. Friedman Nov 02 '13 at 12:50
  • @SimonO101 Just to clarify: I don't think you hold that misconception--you're coming at it from a place of having seen the exceptions. But most people don't know the rule, nevermind the exceptions... – Ari B. Friedman Nov 02 '13 at 13:05
  • @AriB.Friedman lol - no worries! Actually I do (*did*) hold that misconception. I mean I would never use `apply` to e.g. do `rowMeans` but a `for` loop is faster than `apply(m,1,mean)` which I never would've thought! And I am wondering why someone decided to down vote a perfectly well laid out question, with sample data and timing information!!? – Simon O'Hanlon Nov 02 '13 at 13:09
  • @SimonO101, thank you, I wanted to make sure I posted something that was reproducible. So much more for me to learn about R! – reviewer3 Nov 03 '13 at 00:23
  • 1
    Thank you, both - this is exactly why I posted the question: I had the expectation that apply would be faster and when it wasn't, I wanted to know why. @AriB.Friedman, those 2 posts are very helpful - I'm going through and experimenting. – reviewer3 Nov 03 '13 at 00:23

3 Answers3

2

UPDATE2 & potential answer:

I now simplified fx.test4 as follows and it is now equivalent in speed to the for loop. Therefore, it was the extra conversion steps that made the lapply solution slower as @John pointed out. In addition, maybe the assumption that *apply HAD to be faster was faulty as discussed by @Ari B. Friedman and @SimonO101 Thank you All!

fx.test5<-function(vc) 
    {
        L<-strsplit(vc, split=",")
        m.res<-t(sapply(seq_along(L), function(X){sort(c(as.numeric(L[[X]]),rep(0,3)),decreasing=TRUE)[1:3]}))
        return(m.res)
    }

fx.test5(vc)
      [,1] [,2] [,3]
 [1,]  129  129  120
 [2,]  103   67   67
 [3,]    4    3    3
 [4,]    4    3    1
 [5,]    0    0    0
 [6,]    5    0    0
 [7,]   99    1    1
 [8,]   52   44   40
 [9,]   20   19   19
[10,]  135   97   96

system.time(fx.test5(vc))
   user  system elapsed 
  0.001   0.000   0.001 

UPDATE3: Indeed, on a longer example, the *apply function is faster (by a hair).

system.time(fx.test3(vc2))
#   user  system elapsed 
#  3.596   0.006   3.601 
system.time(fx.test5(vc2))
#   user  system elapsed 
#  3.355   0.006   3.359
reviewer3
  • 243
  • 3
  • 11
1

Your problem can be solved using concat.split function from splitstackshape package:

library(splitstackshape)
kk<-data.frame(vc)
nn<-concat.split(kk,split.col="vc",sep=",")
head(nn[1:10,1:4])
                           vc vc_1 vc_2 vc_3
1             120,129,129,114  120  129  129
2 103,67,67,67,67,10,10,10,12  103   67   67
3 2,1,1,1,2,4,3,1,1,1,3,2,1,1    2    1    1
4             1,3,1,1,1,1,1,4    1    3    1
5                        <NA>   NA   NA   NA
6                           5    5   NA   NA

You can manipulate the nn dataframe to get the columns with max value.

Metrics
  • 15,172
  • 7
  • 54
  • 83
  • Thank you, @Metrics. I'm trying to actually understand the apply family better and R itself, but it's good to know there's a convenient package for this purpose. – reviewer3 Nov 03 '13 at 00:27
1

You're doing lots of stuff in your loops, apply or for, that shouldn't be. The main feature of apply is not so much that it is faster than for but that it encourages expression that allows you to keep things vectorized as much as possible (i.e. as little in your loops as possible). The thing that R is particularly slow at is interpreting a function call and each time through the loop it needs to interpret every function call it encounters. Sometimes loops are unavoidable but they should be made as small as possible.

Your strsplit can just be used outside the first sapply. That way you call it once. Then you also don't need unlist before as.numeric. You can also sort with decreasing = FALSE instead of additionally calling tail (although maybe that's as fast as a [1:3] selector). All of that saves you function interpretation in your loop being called over and over.

You don't have to pre-allocate your matrix because you're going to generate the values all at once and shape them into a matrix.

See if following that advice speeds things up.

John
  • 23,360
  • 7
  • 57
  • 83
  • Thank you, @John, it's true I simply don't "get" the apply family yet. And yes, I didn't have to pre-allocate in fx2, i just did it to be consistent. How can I do the strsplit outside? Thank you for the other pointers as well! – reviewer3 Nov 03 '13 at 00:26
  • As Yoda R says, "don't ask, do" (maybe I'll change my name on here and write my answers that way). What does the result of `strsplit(vc)` look like? – John Nov 03 '13 at 00:28
  • omg. That's so much easier - I didn't think it could be that straightforward, so I thought you meant something more complicated. Now I just have to get it out the list, but should be much easier!! Thank you! – reviewer3 Nov 03 '13 at 00:32
  • As an aside, the difference in speed between your particular applu and for loop versions are not the apply and for per se. You make lots of other changes as well like use rev and [1:3] instead of tail. – John Nov 03 '13 at 00:45
  • yes you're completely right - I had to change the flow a bit to make the apply statements work for me, but then i didn't go back to change the for loop, too. Very inconsistent of me, I should have done that; I've come across other examples where the choice of vectorized function matters a lot. Good point! – reviewer3 Nov 03 '13 at 00:56
  • Well, it's definitely not the rev [1:3] vs. tail (just checked it), but now that I realize I can do strsplit directly, this will speed up everything anyway. My lesson from this is: try the simplest solution first, thank you! – reviewer3 Nov 03 '13 at 01:05