12

I have 12 data.frames to work with. They are similar and I have to do the same processing to each one, so I wrote a function that takes a data.frame, processes it, and then returns a data.frame. This works. But I am afraid that I am passing around a very big structure. I may be making temporary copies (am I?) This can't be efficient. What is the best way to avoid passing a data.frame around?

doSomething <- function(df) {
  // do something with the data frame, df
  return(df)
}
Cath
  • 23,906
  • 5
  • 52
  • 86
Chang Chung
  • 2,307
  • 1
  • 17
  • 16

3 Answers3

10

You are, indeed, passing the object around and using some memory. But I don't think you can do an operation on an object in R without passing the object around. Even if you didn't create a function and did your operations outside of the function, R would behave basically the same.

The best way to see this is to set up an example. If you are in Windows open Windows Task Manager. If you are in Linux open a terminal window and run the top command. I'm going to assume Windows in this example. In R run the following:

col1<-rnorm(1000000,0,1)
col2<-rnorm(1000000,1,2)
myframe<-data.frame(col1,col2)

rm(col1)
rm(col2)
gc()

this creates a couple of vectors called col1 and col2 then combines them into a data frame called myframe. It then drops the vectors and forces garbage collection to run. Watch in your windows task manager at the mem usage for the Rgui.exe task. When I start R it uses about 19 meg of mem. After I run the above commands my machine is using just under 35 meg for R.

Now try this:

myframe<-myframe+1

your memory usage for R should jump to over 144 meg. If you force garbage collection using gc() you will see it drop back to around 35 meg. To try this using a function, you can do the following:

doSomething <- function(df) {
    df<-df+1-1
return(df)
}
myframe<-doSomething(myframe)

when you run the code above, memory usage will jump up to 160 meg or so. Running gc() will drop it back to 35 meg.

So what to make of all this? Well, doing an operation outside of a function is not that much more efficient (in terms of memory) than doing it in a function. Garbage collection cleans things up real nice. Should you force gc() to run? Probably not as it will run automatically as needed, I just ran it above to show how it impacts memory usage.

I hope that helps!

JD Long
  • 59,675
  • 58
  • 202
  • 294
9

I'm no R expert, but most languages use a reference counting scheme for big objects. A copy of the object data will not be made until you modify the copy of the object. If your functions only read the data (i.e. for analysis) then no copy should be made.

0

I came across this question looking for something else, and it's old - so I'll just provide a brief answer for now (leave a comment if you'd like more explanation).

You can pass around environments in R which contain anywhere from 1 to all of your variables. But probably you don't need to worry about it.

[You might also be able to do something similar with classes. I only currently understand how to use classes for polymorphic functions - and note there's more than 1 class system kicking around.]

Dav Clark
  • 1,430
  • 1
  • 13
  • 26