12

I've noticed that R keeps the index from for loops stored in the global environment, e.g.:

for (ii in 1:5){ }

print(ii)
# [1] 5

Is it common for people to have any need for this index after running the loop?

I never use it, and am forced to remember to add rm(ii) after every loop I run (first, because I'm anal about keeping my namespace clean and second, for memory, because I sometimes loop over lists of data.tables--in my code right now, I have 357MB-worth of dummy variables wasting space).

Is there an easy way to get around this annoyance? Perfect would be a global option to set (a la options(keep_for_index = FALSE); something like for(ii in 1:5, keep_index = FALSE) could be acceptable as well.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • It's not a silly question at all - my initial reaction would be to consider using `lapply` or similar family functions instead. It isn't always possible, but usually preferable for a lot of tasks. – thelatemail Apr 14 '15 at 23:33
  • of course; for example, though, one of my for loops runs for 200 lines. defining a function to do what the loop does just to avoid a for loop seems a bit outlandish – MichaelChirico Apr 14 '15 at 23:35
  • 4
    350 megs of just index variables from loops? how do you explain that? are you programming in r with r language or are you programming in r with c language? – rawr Apr 14 '15 at 23:37
  • I don't know what that second question means. but I'm basically running robustness checks--doing the same statistical analysis on different subsamples--looping over the different datasets seems quite natural to me. – MichaelChirico Apr 14 '15 at 23:42
  • See [here](http://stackoverflow.com/questions/29441469/assignment-via-in-a-for-loop-r-data-table) for a related question of mine covering some related issues and perhaps adding some insight into why I'm looping. – MichaelChirico Apr 14 '15 at 23:46
  • 2
    @MichaelChirico - I can't help but feel if you're working on `dt1` `dt2` `dt3` named `data.tables` that you should be working with a list of `data.table`s and `lapply`ing your function to each part of the list. `:=` will update `data.table`s inside a list by reference as far as I know. – thelatemail Apr 15 '15 at 00:03
  • But I'm not applying a single function. For example, in one loop, I create and save 4 plots. I really don't think it makes sense to define this as a function. Add to this the troubles cited in the question above. – MichaelChirico Apr 15 '15 at 01:31
  • It's still a little surprising (following up on @rawr) that you can have on the order of 10 *million* index variables lying around ... ? – Ben Bolker Apr 15 '15 at 02:34
  • @BenBolker some `for` loops are structured like so: `for(data in list_of_data_tables)`, so that the index `data` is actually a `data.table`. The point about clutter still remains--even if I was running a bunch of simple loops over integers, it would be a pain to have `ii`,`jj`,`kk`,`ll`,`mm`,... laying around (though I usually reuse `ii`) – MichaelChirico Apr 15 '15 at 18:12

2 Answers2

7

In order to do what you suggest, R would have to change the scoping rules for for loops. This will likely never happen because i'm sure there is code out there in packages that rely on it. You may not use the index after the for loop, but given that loops can break() at any time, the final iteration value isn't always known ahead of time. And having this as a global option again would cause problems with existing code in working packages.

As pointed out, it's for more common to use sapply or lapply loops in R. Something like

for(i in 1:4) {
   lm(data[, 1] ~ data[, i])
}

becomes

sapply(1:4, function(i) {
   lm(data[, 1] ~ data[, i])
})

You shouldn't be afraid of functions in R. After all, R is a functional language.

It's fine to use for loops for more control, but you will have to take care of removing the indexing variable with rm() as you've pointed out. Unless you're using a different indexing variable in each loop, i'm surprised that they are piling up. I'm also surprised that in your case, if they are data.tables, they they are adding additional memory since data.tables don't make deep copies by default as far as i know. The only memory "price" you would pay is a simple pointer.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • hmm fair point. i'm basing my memory consumption on `tables()`; not sure if that output give overlaps or distinct memory usage. – MichaelChirico Apr 15 '15 at 13:21
  • 1
    also, i think you got close to answering my main question, but could you clarify? you're saying the reason R stores the index variable is that `for` loops don't create their own environments like normal functions do, so that the index is initialized and stored in the main environment? – MichaelChirico Apr 15 '15 at 18:10
  • Basically. Scopes are only created when you *define* functions, not when you call them. So the `for` loop isn't that different from any "normal" function. If i did `mean(i<-1:10)`, the `i` variable would exist after the call to `mean()` even if it didn't exist before the call. – MrFlick Apr 15 '15 at 18:25
  • 1
    So the `for` function is implicitly assigning the index via `in`. Got it! – MichaelChirico Apr 15 '15 at 18:27
3

I agree with the comments above. Even if you have to use for loop (using just side effects, not functions' return values) it would be a good idea to structure your code in several functions and store your data in lists.

However, there is a way to "hide" index and all temporary variables inside the loop - by calling the for function in a separate environment:

do.call(`for`, alist(i, 1:3, {
  # ...
  print(i)
  # ... 
}), envir = new.env())

But ... if you could put your code in a function, the solution is more elegant:

for_each <- function(x, FUN) {
  for(i in x) {
    FUN(i)
  }
}

for_each(1:3, print)

Note that with using "for_each"-like construct you don't even see the index variable.

bergant
  • 7,122
  • 1
  • 20
  • 24