6

I am trying to ptrepack a HDF file that was created with pandas HDFStore pytables interface. The main index of the dataframe was time but I made some more columns data_columns so that I can filter for data on-disk via these data_columns.

Now I would like to sort the HDF file by one of those columns (because the selection is too slow for my taste, 84 GB file), using ptrepack with the sortby option like so:

()[maye@luna4 .../nominal]$ ptrepack --chunkshape=auto --propindexes --complevel=9 --complib=blosc --sortby=clat C9.h5 C9_sorted.h5

and I get the error message:

()[maye@luna4 .../nominal]$ Problems doing the copy from 'C9.h5:/' to 'C9_sorted.h5:/' The error was --> : Field clat must have associated a 'full' index in table /df/table (Table(390557601,)) ''. The destination file looks like: C9_sorted.h5 (File) '' Last modif.: 'Fri Jul 26 18:17:56 2013' Object Tree: / (RootGroup) '' /df (Group) '' /df/table (Table(0,), shuffle, blosc(9)) ''

Traceback (most recent call last): File "/usr/local/epd/bin/ptrepack", line 10, in sys.exit(main()) File "/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 480, in main upgradeflavors=upgradeflavors) File "/usr/local/epd/lib/python2.7/site-packages/tables/scripts/ptrepack.py", line 225, in copyChildren raise RuntimeError("Please check that the node names are not " RuntimeError: Please check that the node names are not duplicated in destination, and if so, add the --overwrite-nodes flag if desired. In particular, pay attention that rootUEP is not fooling you.

Does this mean, that I can not sort a HDF file by an index column, because they are not 'full' indexes?

K.-Michael Aye
  • 5,465
  • 6
  • 44
  • 56
  • I think that is correct. default indicies are 'light', level 6 IIRC. see http://pandas.pydata.org/pandas-docs/dev/io.html#indexing. change to level 9, 'full' is tantamount to creating a CSI. you can see more info about the table by print store.get_storer('df') – Jeff Jul 27 '13 at 11:28
  • also try leaving out the --propindices which I believe might be incompatible with sortby – Jeff Jul 27 '13 at 11:30
  • But when I leave out propindices I can not on-disk filter for data, which is a must for a 84 GB database file? – K.-Michael Aye Jul 27 '13 at 18:11
  • Oh, and as you saw, I am not checking for CSI, so that requirement should be fulfilled? – K.-Michael Aye Jul 27 '13 at 18:12
  • sortby creates the new indicies that's why u don't need to prop the old indices – Jeff Jul 28 '13 at 00:21
  • I see, that probably means as well, that if I foresee a sortby operation later on, that I should not waste write-time by using data_columns at the moment of store creation. – K.-Michael Aye Jul 30 '13 at 01:53
  • no, you do need them if you *ever* want to sort by that column at all. you also might experiment, with not writing indexes at ALL (when you are appending), pass ``index=False``, and then create it at the end (with sortby and ptrepack. or see this: http://pytables.github.io/usersguide/libref/structured_storage.html#the-table-class, something like ``store.get_storer('df').table.reindex_dirty()`` – Jeff Jul 30 '13 at 02:02

2 Answers2

11

I have tested several of the options Jeff mentions in our chatty discussions above.

Please have a look at this notebook, hopefully it will help you to make relevant decisions for your data storage: https://nbviewer.ipython.org/810bd0720bb1732067ff The gist for the notebook is here: https://gist.github.com/michaelaye/810bd0720bb1732067ff

My main conclusions:

  • Using index=False has several impressive effects:
    1. It reduces the file size of the resulting HDF file.
    2. It creates the HDF file much faster.
    3. Even so ptdump and the storer().group.table print-out does not show any index, the store display still shows indexers and data-columns (That's probably ignorance of the pytables machinery on my side).
  • Creating an index via store.create_table_index() does nothing yet to the speed of data selection via one of the data-columns.
  • This index HAS to be a 'full' index, so that the later ptrepack with --sortby does not bail. But it does NOT have to be a index level 9. The default level 6 is fine, and does not seem to influence the data selection speed significantly. Maybe it would though with many columns?
  • Using --propindexes almost doubles the ptrepacking time with a slight improvement in data selection speed.
  • Using compression and --propindexs is only slightly slower than using --propindex alone, while the data size (at least in this example) does not go down dramatically.
  • The data selection speed does not seem to be hugely different by having used compression.
  • the speedup for this example of 1 mio. lines of 2 columns random data by just using --sortby without --propindexes is approx factor 5 after sorting for the column of selection.

For completion, the ultra-short summary of commands:

df = pd.DataFrame(randn(1e6,2),columns=list('AB')).to_hdf('test.h5','df',
                  data_columns=list('AB'),mode='w',table=True,index=False)
store = pd.HDFStore('test.h5')
store.create_table_index('df',columns=['B'], kind='full')
store.close()

And in the shell:

ptrepack --chunkshape=auto --sortby=B test.h5 test_sorted.h5
raphael
  • 2,159
  • 17
  • 20
K.-Michael Aye
  • 5,465
  • 6
  • 44
  • 56
  • @K"the store display still shows indexers and data-columns (That's probably ignorance of the pytables machinery on my side)" – eldad-a Dec 29 '13 at 15:57
  • 1
    ... - If I understand you correctly the confusion arises from the use of the term "index" - while for pandas this is and indexer of the rows/columns, in pyTables this is an index for queries (so no indexing was done on the pyTables side, while the columns are indexers as you called using `data_columns=...`). What I personally am not sure about yet is the need of the auto-generated "column" titled "index". As far as I understand this enumerates the rows, but I believe this is already builtin in pyTables – eldad-a Dec 29 '13 at 16:04
  • 1
    Very interesting stuff, thank you. To be fair to the compression algorithms, remember that you are using random data (which is hard to compress). With 'real' data, you might get significantly smaller file sizes when using compression. – Jack Kelly Nov 18 '14 at 15:01
8

Here is a complete example.

Create the frame with a data_column. Reset the index to a full index. Use ptrepack to sortby it.

In [16]: df = DataFrame(randn(10,2),columns=list('AB')).to_hdf('test.h5','df',data_columns=['B'],mode='w',table=True)

In [17]: store = pd.HDFStore('test.h5')

In [18]: store
Out[18]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable,nrows->10,ncols->2,indexers->[index],dc->[B])

In [19]: store.get_storer('df').group.table
Out[19]: 
/df/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

In [20]: store.create_table_index('df',columns=['B'],optlevel=9,kind='full')

In [21]: store.get_storer('df').group.table
Out[21]: 
/df/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "B": Index(9, full, shuffle, zlib(1)).is_CSI=True}

 In [22]: store.close()

 In [25]: !ptdump -avd test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.0',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['B'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {'index': {}},
    levels := 1,
    nan_rep := b'nan',
    non_index_axes := [(1, ['A', 'B'])],
    pandas_type := b'frame_table',
    pandas_version := b'0.10.1',
    table_type := b'appendable_frame',
    values_cols := ['values_block_0', 'B']]
/df/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(9, full, shuffle, zlib(1)).is_csi=True}
  /df/table._v_attrs (AttributeSet), 15 attributes:
   [B_dtype := b'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 10,
    TITLE := '',
    VERSION := '2.6',
    index_kind := b'integer',
    values_block_0_dtype := b'float64',
    values_block_0_kind := ['A']]
  Data dump:
[0] (0, [1.10989047288066], 0.396613633081911)
[1] (1, [0.0981650001268093], -0.9209780702446433)
[2] (2, [-0.2429293157073629], -1.779366453624283)
[3] (3, [0.7305529521507728], 1.243565083939927)
[4] (4, [-0.1480724789512519], 0.5260130757651649)
[5] (5, [1.2560020435792643], 0.5455842491255144)
[6] (6, [1.20129355706986], 0.47930635538027244)
[7] (7, [0.9973598999689721], 0.8602929579025727)
[8] (8, [-0.40070941088441786], 0.7622228032635253)
[9] (9, [0.35865804118145655], 0.29939126149826045)

This is a another way to create a completely sorted index (as opposed to writing it this way)

In [23]: !ptrepack --sortby=B test.h5 test_sorted.h5

In [26]: !ptdump -avd test_sorted.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['B'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {'index': {}},
    levels := 1,
    nan_rep := b'nan',
    non_index_axes := [(1, ['A', 'B'])],
    pandas_type := b'frame_table',
    pandas_version := b'0.10.1',
    table_type := b'appendable_frame',
    values_cols := ['values_block_0', 'B']]
/df/table (Table(10,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  /df/table._v_attrs (AttributeSet), 15 attributes:
   [B_dtype := b'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 10,
    TITLE := '',
    VERSION := '2.6',
    index_kind := b'integer',
    values_block_0_dtype := b'float64',
    values_block_0_kind := ['A']]
  Data dump:
[0] (2, [-0.2429293157073629], -1.779366453624283)
[1] (1, [0.0981650001268093], -0.9209780702446433)
[2] (9, [0.35865804118145655], 0.29939126149826045)
[3] (0, [1.10989047288066], 0.396613633081911)
[4] (6, [1.20129355706986], 0.47930635538027244)
[5] (4, [-0.1480724789512519], 0.5260130757651649)
[6] (5, [1.2560020435792643], 0.5455842491255144)
[7] (8, [-0.40070941088441786], 0.7622228032635253)
[8] (7, [0.9973598999689721], 0.8602929579025727)
[9] (3, [0.7305529521507728], 1.243565083939927)
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Thanks! What happens when I append another dataframe to a store that has a full index on that column? Will it automatically continue to be? Or is it required to do the store.create_table_index on the final file? Is it then happening on disk and therefore does not create memory issues when my store has grown to 80 Gigs? – K.-Michael Aye Jul 27 '13 at 18:19
  • I think when u append it will index to the new scheme. the only reason you actually need to sortby is to force it to reindex (you can actually do this via a pytables function call as well, which is what sortby is doing) – Jeff Jul 28 '13 at 00:20
  • what's with the `pandas_version := b'0.10.1'` in the ptdump output? – K.-Michael Aye Jul 30 '13 at 19:14
  • I added the missing ptrepack that you most likely had used to create test_sorted.h5. Please check that it's correct. – K.-Michael Aye Jul 30 '13 at 19:26
  • was the last time the version changed (the actual structure of the table meta data that pandas uses to reconstruct the data). thus hasn't changed in a while (there was a 0.10.0 for a brief time). Allows backward compat reads if we ever find a need to change the structure. – Jeff Jul 30 '13 at 19:26
  • @K.-MichaelAye I didn't actually do it that way, but created it directly (though your method is correct as well), and prob more useful to what you are actually doing – Jeff Jul 30 '13 at 19:29
  • Sorry, I don't understand. Your first test.h5 is clearly not sorted by column B? I understood to have a sorted HDF file I need to do both: Convert one index to full index and THEN sort it with ptrepack --sortby? – K.-Michael Aye Jul 30 '13 at 20:06
  • I think for your application, using ``ptrepack`` is correct, e.g. write in whatever order you have the data coming in, then ``sortby`` to create a CSI (as you are query by a different primary field). However, you can also set the index to be CSI when you create it, so it will create the index as you append (but I suspect that might be slower as it potentially may have to move data more than once) – Jeff Jul 30 '13 at 20:10
  • you might want to create using ``index=False``, which creates NO indexes, then sort (as there is *some* overhead to creating the indices in the first place, how much, I don't know) – Jeff Jul 30 '13 at 20:10
  • But what I'm saying is, I don't understand how *your* first solution can be a solution as the dump indicates that it is indeed not sorted? – K.-Michael Aye Jul 30 '13 at 20:21
  • it was just showing how to do it starting with a default setup – Jeff Jul 30 '13 at 20:27