12

I have a large dataset (~600 GB) stored as HDF5 format. As this is too large to fit in memory, I would like to convert this to Parquet format and use pySpark to perform some basic data preprocessing (normalization, finding correlation matrices, etc). However, I am unsure how to convert the entire dataset to Parquet without loading it into memory.

I looked at this gist: https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py, but it appears that the entire dataset is being read into memory.

One thing I thought of was reading the HDF5 file in chunks and saving that incrementally into a Parquet file:

test_store = pd.HDFStore('/path/to/myHDFfile.h5')
nrows = test_store.get_storer('df').nrows
chunksize = N
for i in range(nrows//chunksize + 1):
    # convert_to_Parquet() ...

But I can't find any documentation that would allow me to incrementally build up a Parquet file. Any links to further reading would be appreciated.

denfromufa
  • 5,610
  • 13
  • 81
  • 138
Eweler
  • 407
  • 5
  • 14

2 Answers2

18

You can use pyarrow for this!

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

    stream = pd.read_hdf(h5_file, chunksize=chunksize)

    for i, chunk in enumerate(stream):
        print("Chunk {}".format(i))

        if i == 0:
            # Infer schema and open parquet file on first chunk
            parquet_schema = pa.Table.from_pandas(df=chunk).schema
            parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)

    parquet_writer.close()
ostrokach
  • 17,993
  • 11
  • 78
  • 90
1

Thanks for your answer, I tried calling the below py script from CLI but it neither shows any error nor I could see converted parquet file.

And h5 files are not empty as well.enter image description here

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq

h5_file = "C:\Users...\tall.h5" parquet_file = "C:\Users...\my.parquet"

def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

stream = pd.read_hdf(h5_file, chunksize=chunksize)

for i, chunk in enumerate(stream):
    print("Chunk {}".format(i))
    print(chunk.head())

    if i == 0:
        # Infer schema and open parquet file on first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)
parquet_writer.close()
R S
  • 21
  • 2
  • 1
    The pandas `read_hdf` method expects an hdf5 file containing a single table. For hdf5 files containing multiple tables in a custom hierarchy, you need to write custom code to extract each of the tables. Two Python packages that may be useful for this are [h5py](https://github.com/h5py/h5py) and [PyTables](https://github.com/PyTables/PyTables). Also, this should probably be a new question rather than an answer to an existing question. – ostrokach Aug 07 '20 at 13:22
  • Thanks for your reply! But my hdf5 file contains n no of tables and I dont want to mention all tables explicitly in my code. "N" changes for each file. When I tried wity h5py, I believe - its mandate to specify the names. Please suggest otherwise. Let me also start a new thread for the same. – R S Aug 09 '20 at 02:07