1

In R, I create a data frame in a following way:

data <- data.frame(dummy=rep('dummy',10000))
data$number = 0
data$another = 1

When I run a for loop that assigns values to data frame (iterating through rows), my code runs infinitely slow

calculation <- function() {2}
somethingElse <- function() {3}

system.time(
 for (i in 1:10000) {
   data[i,2]=calculation()
   data[i,3]=somethingElse()
 }
)

The above snippet runs in 20 seconds on my laptop. In other languages like C or Java, this finishes instantly. Why is it so slow in R? I remember reading that R stores matrices column by column (unlike C, for example, where it's row by row). But still, I'm puzzled about why it takes so much time. Shouldn't my data.frame fit comfortably in memory (eluding slow disk write behavior)?

As a continuation of my question, I'd like to ask for a quick way to fill my data frame by row, if there exists one.

EDIT: Please note that I'm not trying to assign constants 2 and 3 to my data frame, in the actual problem that I was trying to solve calculation() and somethingElse() are a bit more complicated and depend on another data frame. My question is about efficient insertion into data frame in a loop (and I'm also curious about why this is so slow).

Davor
  • 61
  • 5
  • [Initialize your data structures, then fill them in, rather than expanding them each time.](http://stackoverflow.com/a/8474941/636656). Illustration of why this is so [bad](http://menugget.blogspot.com/2011/11/another-aspect-of-speeding-up-loops-in.html#more). – Ari B. Friedman May 23 '13 at 20:35
  • 1
    If you are unable to provide an example that actually matches your situation, no one will be able to help you. – joran May 23 '13 at 20:38
  • In the first snippet, I do initialize the data frame. If you do a str(data) after the first snippet, it is "10000 obs. of 3 variables". – Davor May 23 '13 at 20:42
  • Look, the short answer is that R isn't C, and so techniques that are fast in C may be slow in R. If I had to guess, the real solution to your problem would be to totally rethink how you're calculating the values being inserted. But we obviously can't help with that, because you've provided us no information on that topic. – joran May 23 '13 at 20:45
  • Once again, the reason it is slow is that R is at heart functional, which means functions don't (usually) have side effects, which in turn necessitates a certain amount of copying of arguments. In this case, R is likely minimizing the copying as much as it can, but there are limits within which is has to operate in that regard. But as I said, the solution is to completely rethink how the functions calculation() and somethingElse() work, such that they calculate more than one value at a time. – joran May 23 '13 at 20:55
  • 1
    @joran I'm quite clear that R is not C. If you know why this is so slow, please elaborate in more detail. What kind of argument copying is taking place here? Why is reading from data frame in the same way fast, but writing is not (even though data frame is pre-allocated)? – Davor May 23 '13 at 21:09
  • take a look at this - http://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r – eddi May 23 '13 at 21:24
  • I'm not enough of an expert at R internals to explain what's going on at the C level, but try removing the `system.time`, add `tracemem(data)` before the for loop, and reduce the loop to only 5-10 iterations. You'll see all the copying taking place. – joran May 23 '13 at 21:37

1 Answers1

0

The answer is vectorization:

data[,2] = 2
data[,3] = 3

finishes instantly for me. For loops in interpreted languages like R are veeeeery slow. Performing this kind of operation by assigning a vector directly (i.e. vectorized) is much, much faster.

Programming in a new language requires a new mindset. Your approach breathes a compiled language, no need for the for loop.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • In my actual problem, I assign to this data frame something that I've calculated (from some other data). I know that I can assign all the same values using what you wrote above, but my values all different and I need to assign them one by one. – Davor May 23 '13 at 20:33
  • 4
    The for loop isn't slow, it's the copying induced by the assignment. – joran May 23 '13 at 20:35
  • 1
    @Davor if you want an accurate answer, please extend your example above. This answers the question you ask above, although this is not your real question. – Paul Hiemstra May 23 '13 at 20:39
  • I've edited my question so that it's more clear about what I'm actually asking. – Davor May 23 '13 at 20:52
  • As long as the result of `calculation` is vector of the length of `data[,2]`, this will still work fine. – Paul Hiemstra May 23 '13 at 20:54
  • 1
    @joran I know that for loop isn't slow, it's writing to data frame. What's puzzling is that reading from data frame in the same way finishes within one second. – Davor May 23 '13 at 20:54
  • @PaulHiemstra I've tried declaring two helper vectors (both numeric) of size 10000, that I fill in with the results of calculation() and somethingElse(). Then I assign these two vectors to my data frame and it finishes in no time (less than second). So that is actually a solution to one part of my question. Another part I'm still unclear about, why is it so slow in the first place :-/ – Davor May 23 '13 at 21:06