Indexing and Data Columns in Pandas/PyTables

Question

http://pandas.pydata.org/pandas-docs/stable/io.html#indexing

I'm really confused about this concept of Data columns in Pandas HDF5 IO. Plus there's very little to no information about it to be found on googling it either. Since I'm diving into Pandas in a large project which involves HDF5 storage, I'd like to be clear about such concepts.

The docs say:

You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable columns, which you can always query). For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. You can specify data_columns = True to force all columns to be data_columns

This is confusing:

other than the indexable columns, which you can always query: What are 'indexable' columns? aren't all columns 'indexable'? What does this term mean?
For instance say you want to perform this common operation, on-disk, and return just the frame that matches this query. How is that different from normal querying on a Pytable; with or without any indexes of data_columns?
What is the fundamental difference between a non-indexed, indexed, and data_column column?

I was having the same issue, with using `HDFStore.select_column` function. Found this only after figuring out that the column needed to be set in `data_columns`. This issue on github delves further into this: https://github.com/pandas-dev/pandas/issues/21188 — Nikhil VJ, May 24 '18 at 16:37

Jeff · Accepted Answer · 2014-09-07T22:21:34.903

You should just try it.

In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])

In [23]: store = pd.HDFStore('test.h5',mode='w')

In [24]: store.append('df_only_indexables',df)

In [25]: store.append('df_with_data_columns',df,data_columns=True)

In [26]: store.append('df_no_index',df,data_columns=True,index=False)

In [27]: store
Out[27]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index                     frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables              frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index])          
/df_with_data_columns            frame_table  (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])

In [28]: store.close()

you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.
If you specify data_columns=True or data_columns=list_of_columns, then these are stored separately and can then be subsequently queried.
If you specify index=False then a PyTables index is not automatically created for the queryable column (eg. the index and/or data_columns).

To see the actual indexes being created (the PyTables indexes), see the output below. colindexes defines which columns have an actual PyTables index created. (I have truncated it somewhat).

/df_no_index/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  /df_no_index/table._v_attrs (AttributeSet), 15 attributes:
   [A_dtype := 'float64',
    A_kind := ['A'],
    B_dtype := 'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'A',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']
/df_only_indexables/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['A', 'B']]
/df_with_data_columns/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (2730,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
   [A_dtype := 'float64',
    A_kind := ['A'],
    B_dtype := 'float64',
    B_kind := ['B'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'A',
    FIELD_2_FILL := 0.0,
    FIELD_2_NAME := 'B',
    NROWS := 5,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

So if you want to query a column, make it a data_column. If you don't then they will be stored in blocks by dtype (faster / less space).

You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).

See the cookbook for a menagerie of questions.

What is the use of setting index=True? It seems to me that I can set data_columns=True, index=False, and I can still query the table using its columns — Michael, Oct 10 '15 at 12:33
sure you can but you won't actually get the benefit of having an index causing a linear scan for any query. index=False is useful when appending, eg multiply appending then construct the index is more efficient that appending with an index (for large amounts of data) — Jeff, Oct 10 '15 at 13:33
how do I construct an index after finish appending? I couldn't find an example on the pandas documentation? Thank you for your help. — Michael, Oct 10 '15 at 13:51
see this post: http://stackoverflow.com/questions/17893370/ptrepack-sortby-needs-full-index — Jeff, Oct 10 '15 at 15:09
if you would like to add to the cookbook would be gr8 (do a pull request) — Jeff, Oct 10 '15 at 15:09
Do you mean adding this question to the cookbook in http://pandas-docs.github.io/pandas-docs-travis/cookbook.html#hdfstore/ ? Sure — Michael, Oct 10 '15 at 16:06
I don't see the index option in docs of `append`, `put` or `to_hdf`. Perhaps there's been changes? — Nikhil VJ, May 24 '18 at 16:39

Indexing and Data Columns in Pandas/PyTables

1 Answers1

Linked