3

I am experimenting with different pandas-friendly storage schemes for tick data. The fastest (in terms of reading and writing) so far has been using an HDFStore with blosc compression and the "fixed" format.

store = pd.HDFStore(path, complevel=9, complib='blosc')
store.put(symbol, df)
store.close()

I'm indexing by ticker symbol since that is my common access pattern. However, this scheme adds about 1 MB of space per symbol. That is, if the data frame for an microcap stock contains just a thousand ticks for that day, the file will increase by a megabyte in size. So for a large universe of small stocks, the .h5 file quickly becomes unwieldy.

Is there a way to keep the performance benefits of blosc/fixed format but get the size down? I have tried the "table" format, which requires about 285 KB per symbol.

store.append(symbol, df, data_columns=True)

However, this format is dramatically slower to read and write.

In case it helps, here is what my data frame looks like:

exchtime     datetime64[ns]
localtime    datetime64[ns]
symbol               object
country               int64
exch                 object
currency              int64
indicator             int64
bid                 float64
bidsize               int64
bidexch              object
ask                 float64
asksize               int64
askexch              object

The blosc compression itself works pretty well since the resulting .h5 file requires only 30--35 bytes per row. So right now my main concern is decreasing the size penalty per node in HDFStore.

chrisaycock
  • 36,470
  • 14
  • 88
  • 125
  • 1
    AFAIK their is a certain minimum for a block size in PyTables; you can have a look at various option to ``ptrepack`` the file. A 1MB minimum is prob reasonable IMHO. You can also try writing in the ``Table`` format, instead of setting all ``data_columns=True``, just pass ``format='table'``; it will write the table format (but you won't be able to query except by index); but it stores as a single block and so should be almost as fast as fixed (but somewhat more space efficient). – Jeff Feb 07 '14 at 18:46
  • @Jeff Any options in particular I should pass to `ptrepack`? If I don't give any options, the resulting file is the same size. – chrisaycock Feb 07 '14 at 18:59
  • 1
    you can try ``chunkshape``; I don't know if this will change the size though. – Jeff Feb 07 '14 at 19:06
  • @Jeff `chunkshape=auto` shrank the file! I'm going to experiment with this and see how it goes. – chrisaycock Feb 07 '14 at 20:35
  • really? that's great. FYI, their is a also a new blosc filter in PyTables 3.1 (just released), see here: http://pytables.github.io/release-notes/RELEASE_NOTES_v3.1.x.html; not sure what the updated blosc will do (I *think* that pandas will directly pass thru the argument, if it doesn't work, pls file a bug report - currently pandas doesn't validate the compressor) – Jeff Feb 07 '14 at 20:41
  • @Jeff If you write an answer contain the ptrepack and chunkshape, I'll accept it. – chrisaycock Feb 07 '14 at 21:02
  • I've also found that using "fixed" format and compression can lead to corrupted data. Also this is not meant to be supported as if you try to do this in the put statement you will get an error. – Stuart Mitchell Feb 13 '18 at 03:40

2 Answers2

4

AFAIK there is a certain minimum for a block size in PyTables.

Here are some suggestions:

  • You can ptrepack the file, using the option chunkshape='auto'. This will pack it using a chunkshape that is computed from looking at all the data and can repack the data in a more efficient blocksize resulting in smaller file sizes. The reason is that PyTables needs to be informed about the expected number of rows of the final array/table size.

  • You can achieve an optimal chunksize in a Table format by passing expectedrows= (and only doing a single append). However, ptrepacking will STILL have a benefit here.

  • You can also try writing in the Table format, instead of setting all data_columns=True, just pass format='table'; it will write the table format (but you won't be able to query except by index); but it stores as a single block and so should be almost as fast as fixed (but somewhat more space efficient)

  • In PyTables 3.1 (just released), there is a new blosc filter. Which might reduce file sizes. See here

chrisaycock
  • 36,470
  • 14
  • 88
  • 125
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • For those lacking the background: ptrepack is a command-line utility that comes with [PyTables](https://www.pytables.org/index.html). Note that the Python package is installed as `tables`: `python -m pip show tables`. It's a dependency of pandas, so it should be already present. Sample call for repacking with compression: `ptrepack --chunkshape=auto --complevel=9 --complib=blosc ` – normanius Apr 12 '21 at 14:05
  • Note that for PyTables 3.6.1, only enabling compression led to an actual filesize reduction in my case (60x10 table with str and datetime; original .csv: 17kB, original .h5: 1.1MB, repacked/compressed .h5: 25kB). Using `chunkshape='auto'` alone did not help. – normanius Apr 12 '21 at 14:15
0

This augments the previous answer by some examples and explanations. For my version of Pandas (1.2.3) and PyTables (3.6.1), I see the following behavior when writing to an HDF store:

import pandas as pd
df = pd.DataFrame([[1, "a"], [2, "b"], [3, "c"]])

# Create a store with fixed format: creates considerable memory overhead!
# File size store1.h5: 1.1MB
store = pd.HDFStore("store1.h5")
store.put(key="some/key", value=df, format="fixed")
store.close()

# Better: create a store with table format.
# File size store1.h5: 86kB!
store = pd.HDFStore("store2.h5")
store.put(key="some/key", value=df, format="table")
store.close()

Note: Instead of using the store, use directly DataFrame.to_hdf():

df = pd.DataFrame([[1, "a"], [2, "b"], [3, "c"]])
df.to_hdf("store1.h5", key="some/key", format="fixed")
df.to_hdf("store2.h5", key="some/key", format="table")

In this example, the memory-overhead is drastically reduced in the second approach (store2.h5). In more realistic situations, this overhead will become less significant with larger amounts of data. A fixed-format store allows for fast read/write operations, while the table format is more flexible (see docs for details). For instance, the table format can handle mixed data-types (per column) better than the fixed format. See, for instance, what happens if you use df.T.to_hdf(...) in the above examples. The fixed format will issue the below PerformanceWarning (see this post on SO, or this pandas issue), while the table format works just fine.

PerformanceWarning: your performance may suffer as PyTables will pickle 
object types that it cannot map directly to c-types

ptrepack is a command-line utility that comes with PyTables (the package is named tables). To see the current version of PyTables: python -m pip show tables.

Using ptrepack, I can further reduce the file sizes for my dummy examples by applying some compression. (Using option --chunkshape=auto did not have noticeable effect.)

# store1.repack.h5: 1.1MB -> 22kB
ptrepack --complevel=9 --complib=blosc "store1.h5" "store1.repack.h5"
# store2.repack.h5: 86kB -> 9kB
ptrepack --complevel=9 --complib=blosc "store2.h5" "store2.repack.h5"

In summary, saving the data frame in table-format and repacking the resulting store with compression allows to reduce the store's memory footprint. Whether its reasonable to minimize the storage overhead of an HDF store depends on your application.

normanius
  • 8,629
  • 7
  • 53
  • 83