8

I pass a data.frame as parameter to a function that want to alter the data inside:

x <- data.frame(value=c(1,2,3,4))
f <- function(d){
  for(i in 1:nrow(d)) {
    if(d$value[i] %% 2 == 0){
      d$value[i] <-0
    }
  }
  print(d)
}

When I execute f(x) I can see how the data.frame inside gets modified:

> f(x)
  value
1     1
2     0
3     3
4     0

However, the original data.frame I passed is unmodified:

> x
  value
1     1
2     2
3     3
4     4

Usually I have overcame this by returning the modified one:

f <- function(d){
  for(i in 1:nrow(d)) {
    if(d$value[i] %% 2 == 0){
      d$value[i] <-0
    }
  }
  d
}

And then call the method reassigning the content:

> x <- f(x)
> x
  value
1     1
2     0
3     3
4     0

However, I wonder what is the effect of this behaviour in a very large data.frame, is a new one grown for the method execution? Which is the R-ish way of doing this?

Is there a way to modify the original one without creating another one in memory?

vtortola
  • 34,709
  • 29
  • 161
  • 263
  • R by design copies vs modifies. Use larger scoped objects (using, perhaps, non-standard evaluation to just get the name of the object) and/or give `data.table` a whirl (it has many in-place idioms and was crafted for larger data sets). – hrbrmstr Oct 17 '15 at 12:36
  • the r-ish way is not to use loops. what is "large" to you – rawr Oct 17 '15 at 12:36
  • 1
    you might be interested in [data.table package](https://github.com/Rdatatable/data.table/wiki/Getting-started). – Arun Oct 17 '15 at 21:32

1 Answers1

12

Actually in R (almost) each modification is performed on a copy of the previous data (copy-on-writing behavior).
So for example inside your function, when you do d$value[i] <-0 actually some copies are created. You usually won't notice that since it's well optimized, but you can trace it by using tracemem function.

That being said, if your data.frame is not really big you can stick with your function returning the modified object, since it's just one more copy afterall.

But, if your dataset is really big and doing a copy everytime can be really expensive, you can use data.table, that allows in-place modifications, e.g. :

library(data.table)
d <- data.table(value=c(1,2,3,4))
f <- function(d){
  for(i in 1:nrow(d)) {
    if(d$value[i] %% 2 == 0){
      set(d,i,1L,0) # special function of data.table (see also ?`:=` )
    }
  }
  print(d)
}

f(d)
print(d)

# results :
> f(d)
   value
1:     1
2:     0
3:     3
4:     0
> 
> print(d)
   value
1:     1
2:     0
3:     3
4:     0

N.B.

In this specific case, the loop can be replaced with a "vectorized" and more efficient version e.g. :

d[d$value %% 2 == 0,'value'] <- 0

but maybe your real loop code is much more convoluted and cannot be vectorized easily.

Community
  • 1
  • 1
digEmAll
  • 56,430
  • 9
  • 115
  • 140
  • 1
    R does some neat tricks to get around copying though. For instance, if your function only adds/removes columns then doing `df = f(df)` probably won’t copy the whole data frame. – Konrad Rudolph Oct 17 '15 at 12:57
  • Yes, that's true in fact it's better to test if actually copies are generated before changing the code. But I traced the loop code and it seems a copy is created every time a record is set to zero. So, if the data.frame is really big I guess data.table ability to modify objects in-place is the way to go... – digEmAll Oct 17 '15 at 13:02
  • Actually in this particular case the *real* solution is not to use a loop at all. That’s terribly inefficient code to begin with. – Konrad Rudolph Oct 17 '15 at 13:07
  • @digEmAll there's no real way now to know if the copy is *shallow* or *deep* though :-(. `df = data.frame(x=1:2, y=3:4); tracemem(df); df$y = 5:6`, but it's being shallow copied here. – Arun Oct 17 '15 at 21:26
  • @Arun: yes, I don't think everytime a full deep copy is performed (it wouldn't be very efficient otherwise). I guess in your example just a new copy of the modified column is created, am I right ? Still, it's a pity we can't know exactly what is copied and what's not... – digEmAll Oct 18 '15 at 08:25