0

I am currently reading in fairly large data sets into spark for parsing (one data frame equals over 1 million rows). To effectively utilize the h2o.gbm() model I am the concatenating multiple data frames together to create a larger training set. When I run the following code:

       training2 <- as_h2o_frame(sc, Training, strict_version_check = FALSE)
       Error: java.lang.OutOfMemoryError: GC overhead limit exceeded

I have tried to give java more memory by running the following command:

       options(java.parameters = "-Xmx100G")

I am currently running a 32 core vm with 460gb of memory with spark version 2.0.2, rsparkling 2.0.10 and h2o 3.10.5.1. The issue does in fact dissipate when I run my code on smaller data sets as well. Any ideas or insight into this issue would be greatly appreciated.

MMitch
  • 1
  • Your only possibility in that case is, to keep lesser things safed in your memory. I don't really know if it's possible but you'd have to use something like streams to keep the data below max-memory. Also check that you are using a 64bit Java to be able to use more than 2GB of ram. – Christian Jul 20 '17 at 14:37
  • Note one *possible* solution you have not tried is to disable the GC Overhead check; see the linked Q&A. But beware that if the real reason for the problem is that `-Xmx100G` is not enough, then disabling the GC Overhead check will result in your application taking a very long time to die. – Stephen C Jul 20 '17 at 14:51
  • @Christian - if he wasn't using a 64bit JVM then `-Xmx100G` would fail on startup. – Stephen C Jul 20 '17 at 14:55
  • Try writing the dataset out to a file and parsing it again with h2o.importFile(). You didn't say how many columns you have (which matters a lot), but in general 1M rows isn't considered large for H2O. – TomKraljevic Jul 26 '17 at 06:05

0 Answers0