1

I have a Pandas data frame and I need to convert it to H2O frame. I use the following code-

Code:

# Convert pandas dataframe to H2O frame
start_time = time.time()
input_data_matrix = h2o.H2OFrame(input_df)
logger.debug("3. Time taken to convert H2O Frame- " + str(time.time() - start_time))

Output:

2019-02-05 04:38:55,238 logger DEBUG 3. Time taken to convert H2O Frame- 9320.119945764542

The data frame (i.e. input_df) size 183K x 435 with no null or NaN values.

It is taking around 2 hours. Is there any better way to perform this operation?

Anwar Shaikh
  • 1,591
  • 3
  • 22
  • 43
  • cross link, potentially duplicate but not sure if good answer: https://stackoverflow.com/q/46971969/1240268 – Andy Hayden Feb 05 '19 at 19:11
  • specifically this comment --> https://stackoverflow.com/questions/46971969/conversion-of-pandas-dataframe-to-h2o-frame-efficiently#comment80994581_46971969 – gold_cy Feb 05 '19 at 19:18
  • @aws_apprentice The comment says if your data frame has NaN or missing values. Which I don't have. – Anwar Shaikh Feb 05 '19 at 19:21
  • the question revolves around having `NaN` but that comment still applies, you have to write out the whole dataframe from python to java memory and then ship it to the cloud, the comment suggest reducing that workload by cutting out the `pandas` to `h20` step so I do think it applies – gold_cy Feb 05 '19 at 19:23

1 Answers1

3
  1. Save the pandas data frame to a csv file. (Skip this step if you loaded it from a csv file in the first place, and haven't done any data munging on it, of course.)

  2. Put that csv file somewhere the h2o server can see it. (If you are running client and server on the same machine, this is already the case.)

  3. Use h2o.import_file() (in preference to h2o.upload_file() or h2o.H2OFrame())

The h2o.import_file() is the quickest way to get data into H2O, but the file must be visible by the server. When dealing with a remote cluster, this might mean uploading it to that servers file system, or putting it on a web server, or an HDFS cluster, or on AWS S3, etc, etc.

(The reason h2o.upload_file() is slower is that it will do an HTTP POST of the data, from client to server, and h2o.H2OFrame() is slower because it exports your pandas data to a temp csv file, then uses h2o.upload_file(), then deletes the temp file afterwards.)

Darren Cook
  • 27,837
  • 13
  • 117
  • 217
  • Thanks for the insights about how `h2o.H2OFrame()` works. Wouldn't the I/O operations (i.e. writing and reading back from disk) will be slower? – Anwar Shaikh Feb 11 '19 at 20:30
  • @EngineeredBrain `h2o.H2OFrame()` will be joint-slowest in best case. I.e. it is a convenience function that does steps 1, 2 and 3. But when you notice it as a bottleneck, you can usually do better, i.e. if you are going to use the csv file 2+ times, doing step 1 yourself is free after the first time; if the server is running on localhost, step 2 can be skipped, and if you are running a multi-node cluster, `import_file()` can be multi-threaded. – Darren Cook Feb 11 '19 at 21:37