0

While loading my dataset using python code on the AWS server using Spyder, I get the following error:

  File "<ipython-input-19-7b2e7b5812b3>", line 1, in <module>
    ffemq12 = load_h2odataframe_returns(femq12) #; ffemq12 = add_fold_column(ffemq12)

  File "D:\Ashwin\do\init_sm.py", line 106, in load_h2odataframe_returns
    fr=h2o.H2OFrame(python_obj=returns)

  File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 106, in __init__
    column_names, column_types, na_strings, skipped_columns)

  File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 147, in _upload_python_object
    self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings, skipped_columns)

  File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 321, in _upload_parse
    ret = h2o.api("POST /3/PostFile", filename=path)

  File "C:\Program Files\Anaconda2\lib\site-packages\h2o\h2o.py", line 104, in api
    return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)

  File "C:\Program Files\Anaconda2\lib\site-packages\h2o\backend\connection.py", line 415, in request
    raise H2OConnectionError("Unexpected HTTP error: %s" % e)

I am running this python code on Spyder on the AWS server. The code works fine up to half the dataset (1.5gb/3gb) but throws an error if I increase the data size. I tried increasing the RAM from 61gb to 122 GB but it is still giving me the same error.

Loading the data file

femq12 = pd.read_csv(r"H:\Ashwin\dta\datafile.csv")    
ffemq12 = load_h2odataframe_returns(femq12)

Initializing h2o

h2o.init(nthreads = -1,max_mem_size="150G")

Loading h2o

Connecting to H2O server at http://127.0.0.1:54321... successful. -------------------------- ------------------------------------ H2O cluster uptime: 01 secs H2O cluster timezone: UTC H2O data parsing timezone: UTC H2O cluster version: 3.22.1.3 H2O cluster version age: 18 days H2O cluster total nodes: 1 H2O cluster free memory: 133.3 Gb H2O cluster total cores: 16 H2O cluster allowed cores: 16 H2O cluster status: accepting new members, healthy H2O connection proxy: H2O internal security:
False H2O API Extensions: Algos, AutoML, Core V3, Core V4 Python version: 2.7.15 final


I suspect it is a memory issue. But even after increasing RAM and max_mem_size, the dataset is not loading.

Any ideas to fix the error would be appreciated. Thank you.

  • can you verify that AWS isn't enforcing a limit to the dataset size you can read into your cluster? thanks! – Lauren Feb 13 '19 at 22:25

1 Answers1

1

Solution: Don't use pd.read_csv() and h2o.H2OFrame(), and instead use h2o.import_file() directly.

The error message is on the POST /3/PostFile REST command. Which, as far as I can tell from your code and log snippets, means it is uploading to localhost? That is horribly inefficient.

(If not localhost, i.e. your datafile.csv is on your computer, which is outside of AWS, then upload it to S3 first. If you are doing some data munging on your computer, do that, then save it as a new file, and upload that to S3. It doesn't have to be S3: it could be the hard disk if you only have a single machine in your H2O cluster.)

For some background see also my recent answers at https://stackoverflow.com/a/54568511/841830 and https://stackoverflow.com/a/54459577/841830. (I've not marked as duplicate, as though the advice is the same, in each case, the reason is a bit different; here I wonder if you are hitting a limit for maximum HTTP POST file size, perhaps at 2GB? I suppose it could also be running out of disk space, from all the multiple temp copies be made.)

Darren Cook
  • 27,837
  • 13
  • 117
  • 217