39

Consider the following example:

Prepare the data:

import string
import random
import pandas as pd

matrix = np.random.random((100, 3000))
my_cols = [random.choice(string.ascii_uppercase) for x in range(matrix.shape[1])]
mydf = pd.DataFrame(matrix, columns=my_cols)
mydf['something'] = 'hello_world'

Set the highest compression possible for HDF5:

store = pd.HDFStore('myfile.h5',complevel=9, complib='bzip2')
store['mydf'] = mydf
store.close()

Save also to CSV:

mydf.to_csv('myfile.csv', sep=':')

The result is:

  • myfile.csv is 5.6 MB big
  • myfile.h5 is 11 MB big

The difference grows bigger as the datasets get larger.

I have tried with other compression methods and levels. Is this a bug? (I am using Pandas 0.11 and the latest stable version of HDF5 and Python).

Community
  • 1
  • 1
Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
  • 8
    afaik HDF5 is designed to optimize lookup times .. not filesize – Joran Beasley May 19 '13 at 22:01
  • the string columns also used a fixed size per column (that is the max size of all of your columns). read about min_itemsize in the docs – Jeff May 19 '13 at 22:10
  • see this reference for my above comment: http://pandas.pydata.org/pandas-docs/dev/io.html#string-columns. In addition. Your test is just showing a bit of overhead in HDF5. This is very structured data; try the same thing with 1M rows of floats and you will be suprised. – Jeff May 19 '13 at 22:47
  • 1
    Thanks a lot @Jeff. I am still at odds with this. The matrix I created has 100 x 3000= **300,000 float64** values and only **100** strings, and all of them holding the same value, which I would would imagine should benefit from compression. Still, it takes *more* space than the CSV file, even though I am using the highest compression scheme possible. Is this expected? – Amelio Vazquez-Reina May 19 '13 at 23:19
  • @user815423426 my experience has been the same as yours - hdf5 takes up much more space than csv. – Jeff Tratner May 19 '13 at 23:32

1 Answers1

52

Copy of my answer from the issue: https://github.com/pydata/pandas/issues/3651

Your sample is really too small. HDF5 has a fair amount of overhead with really small sizes (even 300k entries is on the smaller side). The following is with no compression on either side. Floats are really more efficiently represented in binary (that as a text representation).

In addition, HDF5 is row based. You get MUCH efficiency by having tables that are not too wide but are fairly long. (Hence your example is not very efficient in HDF5 at all, store it transposed in this case)

I routinely have tables that are 10M+ rows and query times can be in the ms. Even the below example is small. Having 10+GB files is quite common (not to mention the astronomy guys who 10GB+ is a few seconds!)

-rw-rw-r--  1 jreback users 203200986 May 19 20:58 test.csv
-rw-rw-r--  1 jreback users  88007312 May 19 20:59 test.h5

In [1]: df = DataFrame(randn(1000000,10))

In [9]: df
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0    1000000  non-null values
1    1000000  non-null values
2    1000000  non-null values
3    1000000  non-null values
4    1000000  non-null values
5    1000000  non-null values
6    1000000  non-null values
7    1000000  non-null values
8    1000000  non-null values
9    1000000  non-null values
dtypes: float64(10)

In [5]: %timeit df.to_csv('test.csv',mode='w')
1 loops, best of 3: 12.7 s per loop

In [6]: %timeit df.to_hdf('test.h5','df',mode='w')
1 loops, best of 3: 825 ms per loop

In [7]: %timeit pd.read_csv('test.csv',index_col=0)
1 loops, best of 3: 2.35 s per loop

In [8]: %timeit pd.read_hdf('test.h5','df')
10 loops, best of 3: 38 ms per loop

I really wouldn't worry about the size (I suspect you are not, but are merely interested, which is fine). The point of HDF5 is that disk is cheap, cpu is cheap, but you can't have everything in memory at once so we optimize by using chunking

Jeff
  • 125,376
  • 21
  • 220
  • 187