Converting HDF5 to Parquet without loading into memory

Question

I have a large dataset (~600 GB) stored as HDF5 format. As this is too large to fit in memory, I would like to convert this to Parquet format and use pySpark to perform some basic data preprocessing (normalization, finding correlation matrices, etc). However, I am unsure how to convert the entire dataset to Parquet without loading it into memory.

I looked at this gist: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py, but it appears that the entire dataset is being read into memory.

One thing I thought of was reading the HDF5 file in chunks and saving that incrementally into a Parquet file:

test_store = pd.HDFStore('/path/to/myHDFfile.h5')
nrows = test_store.get_storer('df').nrows
chunksize = N
for i in range(nrows//chunksize + 1):
    # convert_to_Parquet() ...

But I can't find any documentation that would allow me to incrementally build up a Parquet file. Any links to further reading would be appreciated.

score 18 · Accepted Answer · answered Sep 11 '17 at 15:01

18

You can use pyarrow for this!

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

    stream = pd.read_hdf(h5_file, chunksize=chunksize)

    for i, chunk in enumerate(stream):
        print("Chunk {}".format(i))

        if i == 0:
            # Infer schema and open parquet file on first chunk
            parquet_schema = pa.Table.from_pandas(df=chunk).schema
            parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)

    parquet_writer.close()

answered Sep 11 '17 at 15:01

ostrokach

17,993
11
78
90

4

Note that Parquet datasets are designed to consist of many files. They do not need to contain a single large file, so the chunkwise approach is a good one. It could be 1000 files and that's fine – Wes McKinney Sep 12 '17 at 01:53
1

Trying this, I get "KeyError: '__index_level_0__'" – K.-Michael Aye Mar 08 '19 at 02:23
1

Tried to add "preserve_index=False" to the `from_pandas` method, but to no avail. – K.-Michael Aye Mar 08 '19 at 02:25
1

ah, I need to set preserve_index=False in the FIRST call to `Table.from_pandas` as well, so that the schema is correctly set! – K.-Michael Aye Mar 08 '19 at 02:42

score 1 · Answer 2 · answered Aug 07 '20 at 08:21

Thanks for your answer, I tried calling the below py script from CLI but it neither shows any error nor I could see converted parquet file.

And h5 files are not empty as well.enter image description here

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq

h5_file = "C:\Users...\tall.h5" parquet_file = "C:\Users...\my.parquet"

def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

stream = pd.read_hdf(h5_file, chunksize=chunksize)

for i, chunk in enumerate(stream):
    print("Chunk {}".format(i))
    print(chunk.head())

    if i == 0:
        # Infer schema and open parquet file on first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)
parquet_writer.close()

The pandas `read_hdf` method expects an hdf5 file containing a single table. For hdf5 files containing multiple tables in a custom hierarchy, you need to write custom code to extract each of the tables. Two Python packages that may be useful for this are [h5py](https://github.com/h5py/h5py) and [PyTables](https://github.com/PyTables/PyTables). Also, this should probably be a new question rather than an answer to an existing question. — ostrokach, Aug 07 '20 at 13:22
Thanks for your reply! But my hdf5 file contains n no of tables and I dont want to mention all tables explicitly in my code. "N" changes for each file. When I tried wity h5py, I believe - its mandate to specify the names. Please suggest otherwise. Let me also start a new thread for the same. — R S, Aug 09 '20 at 02:07

Converting HDF5 to Parquet without loading into memory

2 Answers2