2

I have 51 massive HDF5 tables, each with enough (well behaved) data that I cannot load even one of them completely into memory. To make life easier for the rest of my team I need to transfer this data into a PostgreSQL database (and delete the HDF5 tables). However, this is easier said than done, mainly because of these hurdles:

  1. pandas.read_hdf() still has a wonky chunksize kwag: SO Question; Open github issue
  2. pandas.DataFrame.to_sql() is monumentally slow and inefficient: Open github issue (see my post at the bottom of the issue page)
  3. PostgreSQL does not have a native or third party data wrapper to deal with HDF5: PostgreSQL wiki article
  4. HDF5 ODBC driver is still nascent: HDF5 ODBC blog

Basically to go from HDF5 -> Pandas -> PostgreSQL, will require surmounting hurdles 1 and 2 by extensive monkey patching. And there seems to be no direct way to go from HDF5 -> PostgreSQL directly. Unless I am missing something.

Perhaps one of you fine users can hint at something I am missing, some patchwork you created to surmount a similar issue that would help my cause, or any suggestions or advice...

Community
  • 1
  • 1
Kartik
  • 8,347
  • 39
  • 73
  • May I ask if the reason that you want to move from HDF5 to PostgreSQL is that HDF5 doesn't support complex queries that PostgreSQL supports? From my experience, HDF5 is much faster at simple query (for example, query data for a time range for time-series data) than PostgreSQL. – Tony Jan 17 '19 at 09:15
  • No. PostgreSQL was the storage platform of the company I was working for at that time. They had tools built around reading from PostgreSQL, and didn't want to retool their process flows. I was using HDF5 for the sole reason of speed. But I was required to transfer the data to PostgreSQL. – Kartik Jan 17 '19 at 10:02

1 Answers1

4

You could convert to CSV with something like the following:

import csv
import h5py
with h5py.File('input.hdf5') as hdf5file:
    with open('output.csv', 'w') as csvfile:
        writer = csv.writer(csvfile)
        for row in hdf5file['__data__']['table']:
            writer.writerow(row)

And then then import into postgres with psql:

create table mytable (col1 bigint, col2 float, col3 float);
\copy mytable from 'output.csv' CSV

Depending on the complexity of your data, you could probably do something clever to get the schema out of the hdf5 file and use that to make the CREATE TABLE statement.

Alternatively you could try writing your own INSERT statements in your Python script, this will probably be slower than using COPY but could be a simpler solution:

import psycopg2
from itertools import islice

with h5py.File('input.hdf5') as hdf5file:
    with psycopg2.connect("dbname=mydb user=postgres") as conn
        cur = conn.cursor()
        chunksize=50
        t = iter(hdf5file['__data__']['table'])
        rows = islice(t, chunksize)
        while rows != []:
            statement = "INSERT INTO mytable VALUES {}".format(','.join(rows))
            cur.execute(row)
            rows = islice(t, chunksize)
        conn.commit()
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • This is an alternative that I had not considered. Thank you for pointing this out. However, on a hdf file with roughly 11M rows and 200 columns, don't you think the loop copy will take an astronomical amount of time? I can easily get the file schema and create the create table statement. Also all files have the same structure, so that needs to be done only once (and I can do it manually, if push comes to shove). Still, it is the write row-by-row that is making me not accept this answer... – Kartik Aug 17 '16 at 00:07
  • try it out, i think the write should be very fast, and looping through the hdf5 table like this will use an iterator so it won't use memory. – maxymoo Aug 17 '16 at 00:17
  • 1
    just ran a test, on my laptop writing 200 float columns to csv is around 3000 rows/sec – maxymoo Aug 17 '16 at 00:28
  • I will. Test it out over a range of files. Thank you. – Kartik Aug 17 '16 at 00:30
  • I've added another idea that you could try as well – maxymoo Aug 17 '16 at 00:38
  • In your alternative, is there a way to read a bunch of rows together to do a batch insert? I can read 100k lines from each of 51 files (5.1M rows total) comfortably before maxing out the memory. I've never used h5py, sorry if this is a very basic question. – Kartik Aug 17 '16 at 00:45
  • 1
    i've edited my answer to group the inserts, it will still be much slower than using `\copy` though – maxymoo Aug 17 '16 at 02:25
  • Took about 12 hours using your first method on 12 parallel processes. (Note I had to write each HDF file to its own table, so despite the disk IO bottleneck, reads and writes did not conflict with each other.) Thanks a bunch for saving me time. I was speed and memory testing a few other alternatives I had. Yours pretty much turned out to be simple and fast. I even tried to construct SQL query in Python and pipe it to `psql` the program that loads table backups stored as SQL files, that got a bit messy... – Kartik Aug 18 '16 at 19:02