I ask myself the same question, I will try to resume what I have found.
Given this is similar to your dataset:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 10 columns):
0 100000 non-null float64
1 100000 non-null float64
2 100000 non-null float64
3 100000 non-null float64
4 100000 non-null float64
5 100000 non-null float64
6 100000 non-null float64
7 100000 non-null float64
8 100000 non-null float64
9 100000 non-null float64
dtypes: float64(10)
memory usage: 8.4 MB
Your "basic" dataset will need ≈ 10 MB
General view
So HDF5 is that rare product which excels in two fields: archiving and
sharing data according to strict standardized conventions, and also
ad-hoc, highly flexible and iterative use for local data analysis.
https://hdfgroup.org/wp/2015/03/hdf5-as-a-zero-configuration-ad-hoc-scientific-database-for-python/
Generally, HDF5 works better with int compared with string, it seems it is your case. Maybe it main limitation is concurrency:
Of greater concern are recent postings on a mailing list discussing
use of netCDF and HDF5 in high performance computing applications with
thousands of processors using parallel I/O, which warn of the danger
of file corruption during parallel I/O if a client dies at a
particular time. The HDF Group is aware of this problem and is
addressing it.
https://earthdata.nasa.gov/standards/hdf5
Looking at import efficiency
Example 1:
As shown in the result, the time of data import from HDF5 is the
shortest, only ~50% of import time from CSV and ~25% of import time
from SQLITE.
https://statcompute.wordpress.com/tag/hdf5/
Example 2:
In [18]: %timeit test_sql_read()
1 loops, best of 3: 766 ms per loop
In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop
In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop
HDF5 - concurrency, compression & I/O performance
SQL
Good for:
– Your data is structured (you have a ‘Schema’)
– Relational (tables of rows and columns)
– Mid-Size, <=several GB in total
– Transactional operations (ensuring DB is consistent)
https://www.nersc.gov/assets/Uploads/09-Databases-Wahid2.pdf
In a previous SO discussion:
SQLite has features to support database queries using SQL. HDF5 has
features to support large scientific datasets.
Evaluating HDF5: What limitations/features does HDF5 provide for modelling data?
But Pandas (HDF5 through PyTables) has a rich "query" capacity, recently they included SQL-like queries.
In [171]: df.query('(a < b) & (b < c)')
Out[171]:
a b c
3 0.011763 0.022921 0.244186
8 0.116822 0.364564 0.454607
http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method-experimental
NOSQL
Some insights about HDF5 vs NOSQL
- HDF5 is no database. MongoDB has ACID properties, HDF5 doesn't (might be important).
- HDF5 (unless you are using the MPI version) does not support concurrent write access (read access is possible).
What is a better approach of storing and querying a big dataset of meteorological data
Conclusions
Indeed, for some applications a database management system just isn’t
justified:
- If all the datasets are small,
- If data storage and access requirements aren’t going to change (flexibility isn’t needed).
http://www.barrodale.com/docs/Why%20don't%20scientists%20use%20databases.pdf
Hope it helps.