0

I am working with a very large dataset and I would like to keep the data in H2O as much as possible without bringing it into R.

I noticed whenever I pass an H2O Frame to a function, any modification I make to the Frame is not reflected outside of the function. Is there a way to pass the Frame by Reference?

If not, what's the best way to modify the original frame inside a function with copying all of the Frame?

Another related question: does passing a Frame to other functions (read only), make extra copies on H2O side? My datasets are 30GB - 100GB. So want to make sure passing them around does not cause memory issues.

mod = function(fdx) {
  fdx[,"x"] = -1
}

d = data.frame(x = rnorm(100),y=rnorm(100))
dx = as.h2o(d)
dx[1,]
mod(dx)
dx[1,]  # does not change the original value of x


 > dx[1,]
           x         y
 1 0.3114706 0.9523058

 > dx[1,]
           x         y
 1 0.3114706 0.9523058

Thanks!

Ecognium
  • 2,046
  • 1
  • 19
  • 35
  • `data.table` has the similar mechanism to use `reference`, but I am not sure it can use in your case. you can take a look [here](http://stackoverflow.com/questions/10225098/understanding-exactly-when-a-data-table-is-a-reference-to-vs-a-copy-of-another). – Patric Jan 10 '16 at 00:38

1 Answers1

2

H2O does a classic copy-on-write optimization. Thus:

  • No true copy is made, unless you mutate the dataset.
  • Only changed/added columns are truly copied, all others pass-by-reference
  • Frames in R are pass-by-value, which H2O mimics
  • Frames in Python are pass-by-reference, which H2O mimics

In short, do as you would in R, and you're fine.

No extra copies.

Jonnus
  • 2,988
  • 2
  • 24
  • 33
  • Thanks, @Cliff. To confirm: there is no `data.table` esque way of mutating the original table but reads are fine. Right now, I am returning vectors back to the caller which gets assigned to h2o where the Frame is in scope to avoid the local copy. Another thing that I am unsure about is what happens when you do (in the main scope): `dx = dx[dx$col > 5,]`. Are both Frames in memory on H2O side? – Ecognium Jan 12 '16 at 17:59
  • 1
    H2O does the update-in-place optimization on the H2O side in some cases, and generally will rapidly recycle Big Temps in any case. For the row-selector case, yes both Frames are in memory briefly. At the next R GC the old copy of 'dx' will be reclaimed. – Cliff Click Jan 19 '16 at 22:27