5

I have the following code running and it's taking me a long time to run. How do I know if it's still doing its job or it got stuck somewhere.

noise4<-NULL;
for(i in 1:length(noise3))
{
    if(is.na(noise3[i])==TRUE)
    {
    next;
    }
    else
    {
    noise4<-c(noise4,noise3[i]);
    }
}

noise3 is a vector with 2418233 data points.

Concerned_Citizen
  • 6,548
  • 18
  • 57
  • 75
  • 6
    a for loop with 2 million iterations isn't going to show R in a very good light. – David Heffernan Sep 06 '11 at 16:33
  • 1
    @David Heffernan: especially when you grow an object instead of pre-allocating. @GTyler: it looks like you could just use `noise4 <- na.omit(noise3)`. – Joshua Ulrich Sep 06 '11 at 16:34
  • What do you mean? Is it just R not good with that many data points? – Concerned_Citizen Sep 06 '11 at 16:34
  • So will pre-allocating the size of the vector help? I think I know the resulting dimension of that vector. – Concerned_Citizen Sep 06 '11 at 16:35
  • 1
    @GTyler large for loops don't have good performance characteristics on R due to its interpreted nature. I believe that recent developments have improved things. All the same, you should always look for a version that avoids for loops if possible. – David Heffernan Sep 06 '11 at 16:41
  • 1
    @GTyler Read this question and answer to see why `for` loops in this case are slow and how to avoid it: http://stackoverflow.com/q/6502444/602276 – Andrie Sep 06 '11 at 16:42
  • 5
    @GTyler: R works fine with millions of data points. But `for` loops are one of the least efficient ways of using R. Most R functions iterate through vectors naturally. Have a look at this article: http://yihui.name/en/2010/10/on-the-gory-loops-in-r/ It explains how the R approach differs from "normal" programming. – dnagirl Sep 06 '11 at 16:44

5 Answers5

11

You just want to remove the NA values. Do it like this:

noise4 <- noise3[!is.na(noise3)]

This will be pretty much instant.

Or as Joshua suggests, a more readable alternative:

noise4 <- na.omit(noise3)

Your code was slow because:

  1. It uses explicit loops which tend to be slow under the R interpreter.
  2. You reallocate memory every iteration.

The memory reallocation is probably the biggest handicap to your code.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
5

I wanted to illustrate the benefits of pre-allocation, so I tried to run your code... but I killed it after ~5 minutes. I recommend you use noise4 <- na.omit(noise3) as I said in my comments. This code is solely for illustrative purposes.

# Create some random data
set.seed(21)
noise3 <- rnorm(2418233)
noise3[sample(2418233, 100)] <- NA

noise <- function(noise3) {
  # Pre-allocate
  noise4 <- vector("numeric", sum(!is.na(noise3)))
  for(i in seq_along(noise3)) {
    if(is.na(noise3[i])) {
      next
    } else {
      noise4[i] <- noise3[i]
    }
  }
}

system.time(noise(noise3)) # MUCH less than 5+ minutes
#    user  system elapsed 
#    9.50    0.44    9.94 

# Let's see what we gain from compiling
library(compiler)
cnoise <- cmpfun(noise)
system.time(cnoise(noise3))  # a decent reduction
#    user  system elapsed 
#    3.46    0.49    3.96 
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 2
    +1 I am starting to feel bad about having uncompiled functions. – Iterator Sep 06 '11 at 16:55
  • 1
    @Iterator: don't feel bad. They mostly help "looping" constructs where there are several repeated function calls. You won't see much gain if you're already using vectorized functions. – Joshua Ulrich Sep 06 '11 at 16:57
  • @iterator Actually I think the difference between interpreted and compiled shows that the interpreter does a wonderful job. I am impressed. – David Heffernan Sep 06 '11 at 17:05
  • @JoshuaUllrich: So, what you're saying is that these help on functions that are iterated a lot, eh? Guess what my code does. :) (Admittedly, I vectorize a lot.) – Iterator Sep 06 '11 at 17:11
  • @David Heffernan: while the byte code compiler helps, moving the loop to C/C++ using the inline and/or Rcpp packages would provide near instantaneous results. – Joshua Ulrich Sep 06 '11 at 17:31
  • 1
    As long as you're being pedagogical why not show the timing for `na.omit(noise3)` as well ? (0.5 secs on my computer) – Ben Bolker Sep 06 '11 at 18:14
5

The other answers have given you much, much better ways to do the task that you actually set out to achieve (removing NA values in your data), but an answer to the specific question you asked ("how do I know if R is actually working or if it has instead gotten stuck?") is to introduce some output (cat) statements in your loop, as follows:

rpt <- 10000  ## reporting interval
noise4<-NULL;
for(i in 1:length(noise3))
{
    if (i %% rpt == 0) cat(i,"\n")
    if(is.na(noise3[i])==TRUE)
    {
    next;
    }
    else
    {
    noise4<-c(noise4,noise3[i]);
    }
}

If you run this code you can immediately see that it slows down radically as it gets farther into the loop (a consequence of the failure to pre-allocate space) ...

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
3

The others have all given correct ways to do the same problem, so that you needn't worry about speed. @BenBolker also gave a good pointer regarding regular output.

A different thing to note is that if you find yourself in a loop, you can break out of it and find the value of i. Assuming that re-starting from that value of i won't harm things, i.e. using that value twice won't be a problem, you can restart. Or, you can just finish the job as the others have stated.

A separate trick is that if the loop is slow (and can't be vectorized or else you're not eager to break out of the loop), AND you don't have any reporting, you can still look for an external method to see if R is actually consuming cycles on your computer. In Linux, the top command is your best bet. On Windows, the task manager will do the trick (I prefer to use the SysInternals / Microsoft program Process Explorer). 'top' also exists on Macs, though I believe there are some other more popular tools.

One other word of advice: if you have a really long loop to run, I strongly encourage saving the results regularly. I typically create a file with the a name like: myPrefix_YYYYMMDDHHMMSS.rdat . This way everything can go to hell and you can still start your loop where you left off.

I don't always iterate, but when I do, I use these tricks. Stay speedy, my friend.

Iterator
  • 20,250
  • 12
  • 75
  • 111
0

For one case I've faced, updating all packages in use under R studio resolved the issue.

Rola
  • 1,598
  • 13
  • 12