Yet another apply Questions

Question

I am totally convinced that an efficient R programm should avoid using loops whenever possible and instead should use the big family of the apply functions. But this cannot happen without pain. For example I face with a problem whose solution involves a sum in the applied function, as a result the list of results is reduced to a single value, which is not what I want. To be concrete I will try to simplify my problem assume N =100

sapply(list(1:N), function(n) (
    choose(n,(floor(n/2)+1):n) * 
    eps^((floor(n/2)+1):n) * 
    (1- eps)^(n-((floor(n/2)+1):n))))

As you can see the function inside cause length of the built vector to explode whereas using the sum inside would collapse everything to single value

sapply(list(1:N), function(n) (
    choose(n,(floor(n/2)+1):n) * 
    eps^((floor(n/2)+1):n) * 
    (1- eps)^(n-((floor(n/2)+1):n))))

What I would like to have is a the list of degree of N. so what do you think? how can I repair it?

I was always wondering who is this guy who went all over the world and convinced everyone that `apply` is not a `for` loop — David Arenburg, Nov 09 '14 at 21:26
@DavidArenburg did you know data.table is just one big for loop? — rawr, Nov 09 '14 at 21:30
@raw, everything is just a big `for` loop, the question is in what language it was written. `for` loops and the whole `*apply` are essentially the same (`for` loops written in R), while vectorized functions are `C/C++` `for` loops, which is completely diffrent. Not to mentions that I didn't express my thoughts against `for` loops, rather was surprised by this saying *I am totally convinced that an efficient R program should avoid using loops whenever possible and instead should use the big family of the apply functions* — David Arenburg, Nov 09 '14 at 21:32
Use loops. Then once you're more comfortable with `R`, use `*apply` **sparingly** as a way to simplify the code, but not necessarily as a way to make your code run faster. The adage about not using loops applies to using **vectorized** methods whenever possible. — Carl Witthoft, Nov 09 '14 at 21:46

Oliver Keyes · Accepted Answer · 2014-11-09T21:37:11.860

Your question doesn't contain reproducible code (what's "eps"?), but on the general point about for loops and optimising code:

For loops are not incredibly slow. For loops are incredibly slow when used improperly because of how memory is assigned to objects. For primitive objects (like vectors), modifying a value in a field has a tiny cost - but expanding the /length/ of the vector is fairly costly because what you're actually doing is creating an entirely new object, finding space for that object, copying the name over, removing the old object, etc. For non-primitive objects (say, data frames), it's even more costly because every modification, even if it doesn't alter the length of the data.frame, triggers this process.

But: there are ways to optimise a for loop and make them run quickly. The easiest guidelines are:

Do not run a for loop that writes to a data.frame. Use plyr or dplyr, or data.table, depending on your preference.
If you are using a vector and can know the length of the output in advance, it will work a lot faster. Specify the size of the output object before writing to it.
Do not twist yourself into knots avoiding for loops.

So in this case - if you're only producing a single value for each thing in N, you could make that work perfectly nicely with a vector:

#Create output object. We're specifying the length in advance so that writing to
#it is cheap
output <- numeric(length = length(N))

#Start the for loop
for(i in seq_along(output)){
    output[i] <- your_computations_go_here(N[i])
}

This isn't actually particularly slow - because you're writing to a vector and you've specified the length in advance. And since data.frames are actually lists of equally-sized vectors, you can even work around some issues with running for loops over data.frames using this; if you're only writing to a single column in the data.frame, just create it as a vector and then write it to the data.frame via df$new_col <- output. You'll get the same output as if you had looped through the data.frame, but it'll work faster because you'll only have had to modify it once.

You can safely take out `plyr` out of your first bullet as it usually scales worse than base R functions — David Arenburg, Nov 09 '14 at 22:48
This would be an even better answer if it also mentioned briefly the difference between interpreted code (R `for` loops) and compiled code (R vectorized functions, usually C `for` loops). — Roland, Nov 10 '14 at 10:33
Indeed, but RCpp/C code is...you know; a next step. It's acceptable to treat the concept of "vectorisation" as voodoo, if you're just going to be an R-side user, but understanding how memory works is key. — Oliver Keyes, Nov 10 '14 at 16:27

Yet another apply Questions

1 Answers1