73

I have the following questions about HDF5 performance and concurrency:

  1. Does HDF5 support concurrent write access?
  2. Concurrency considerations aside, how is HDF5 performance in terms of I/O performance (does compression rates affect the performance)?
  3. Since I use HDF5 with Python, how does its performance compare to Sqlite?

References:

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
  • 1
    I think this question is probably not constructive rather than off topic, but I can't think of how it should be phrased to be on topic (I tried and gave up). Will happily vote to reopen and upvote if tweaked. – Andy Hayden May 19 '13 at 11:04

2 Answers2

81

Updated to use pandas 0.13.1

  1. No. http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats. There are various ways to do this, e.g. have your different threads/processes write out the computation results, then have a single process combine.

  2. depending the type of data you store, how you do it, and how you want to retrieve, HDF5 can offer vastly better performance. Storing in an HDFStore as a single array, float data, compressed (in other words, not storing it in a format that allows for querying), will be stored/read amazingly fast. Even storing in the table format (which slows down the write performance), will offer quite good write performance. You can look at this for some detailed comparisons (which is what HDFStore uses under the hood). http://www.pytables.org/, here's a nice picture:

Since PyTables 2.3 the queries are now indexed, so performance is actually MUCH better than this.

To answer your question, if you want any kind of performance, HDF5 is the way to go.

Writing:

In [14]: %timeit test_sql_write(df)
1 loops, best of 3: 6.24 s per loop

In [15]: %timeit test_hdf_fixed_write(df)
1 loops, best of 3: 237 ms per loop

In [16]: %timeit test_hdf_table_write(df)
1 loops, best of 3: 901 ms per loop

In [17]: %timeit test_csv_write(df)
1 loops, best of 3: 3.44 s per loop

Reading

In [18]: %timeit test_sql_read()
1 loops, best of 3: 766 ms per loop

In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop

In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop

In [22]: %timeit test_csv_read()
1 loops, best of 3: 620 ms per loop

And here's the code

import sqlite3
import os
from pandas.io import sql

In [3]: df = DataFrame(randn(1000000,2),columns=list('AB'))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A    1000000  non-null values
B    1000000  non-null values
dtypes: float64(2)
   
def test_sql_write(df):
    if os.path.exists('test.sql'):
        os.remove('test.sql')
    sql_db = sqlite3.connect('test.sql')
    sql.write_frame(df, name='test_table', con=sql_db)
    sql_db.close()

def test_sql_read():
    sql_db = sqlite3.connect('test.sql')
    sql.read_frame("select * from test_table", sql_db)
    sql_db.close()
    
def test_hdf_fixed_write(df):
    df.to_hdf('test_fixed.hdf','test',mode='w')

def test_csv_read():
    pd.read_csv('test.csv',index_col=0)

def test_csv_write(df):
    df.to_csv('test.csv',mode='w')    

def test_hdf_fixed_read():
    pd.read_hdf('test_fixed.hdf','test')

def test_hdf_table_write(df):
    df.to_hdf('test_table.hdf','test',format='table',mode='w')

def test_hdf_table_read():
    pd.read_hdf('test_table.hdf','test')

Of course YMMV.

lane
  • 766
  • 5
  • 20
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Which RDBMS are you using? The plot indicates PostgreSQL, but the code says sqlite3. – Aryeh Leib Taurog Apr 08 '14 at 17:18
  • the timings were from a local ``sqlite`` file-based (inclued in python). The plot was original done by the ``PyTables`` guys and I think used ``postgresql`` (that was a REAL study, very detailed). – Jeff Apr 08 '14 at 17:23
  • 3
    Those sqlite timings are wrong: opening and closing the connection is typically not done at each query and should not be part of the measured time. (Not to mention the system calls to check the presence and delete the previous sqlite file). It may be `timeit`-friendly that way, but it is incorrect. – P-Gn Jan 22 '18 at 13:52
  • 3
    Also, one could argue that a fair(er) comparison of sqlite to hdf5 is not dumping into `BLOB`s rather than "standard" tables. Sqlite is actually pretty good and often better than HDF5 at reading and writing binary data. – P-Gn Jan 22 '18 at 13:58
  • The other modes are handling the connection opening and closing within the same function call as well; plus in use cases like mine (passive server waiting for requests from multiple users) sqlite won't stay perpetually 'connected' and will have to be connected and closed with each function call anyways. So IMHO this comparison is on the right track for the use cases where one-off actions have to happen. Hoping to see more such benchmark tests. – Nikhil VJ May 22 '18 at 08:49
  • Does not include string data. Wonder how it would change then. – Gnudiff Jun 13 '18 at 10:57
3

Look into pytables, they might have already done a lot of this legwork for you.

That said, I am not fully clear on how to compare hdf and sqlite. hdf is a general purpose hierarchical data file format + libraries and sqlite is a relational database.

hdf does support parallel I/O at the c level, but I am not sure how much of that h5py wraps or if it will play nice with NFS.

If you really want a highly concurrent relational database, why not just use a real SQL server?

tacaswell
  • 84,579
  • 22
  • 210
  • 199