You should just try it.
In [22]: df = DataFrame(np.random.randn(5,2),columns=['A','B'])
In [23]: store = pd.HDFStore('test.h5',mode='w')
In [24]: store.append('df_only_indexables',df)
In [25]: store.append('df_with_data_columns',df,data_columns=True)
In [26]: store.append('df_no_index',df,data_columns=True,index=False)
In [27]: store
Out[27]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df_no_index frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
/df_only_indexables frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index])
/df_with_data_columns frame_table (typ->appendable,nrows->5,ncols->2,indexers->[index],dc->[A,B])
In [28]: store.close()
you automatically get the index of the stored frame as a queryable column. By default NO other columns can be queried.
If you specify data_columns=True
or data_columns=list_of_columns
, then these are stored separately and can then be subsequently queried.
If you specify index=False
then a PyTables
index is not automatically created for the queryable column (eg. the index
and/or data_columns
).
To see the actual indexes being created (the PyTables
indexes), see the output below. colindexes
defines which columns have an actual PyTables
index created. (I have truncated it somewhat).
/df_no_index/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": Float64Col(shape=(), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
/df_no_index/table._v_attrs (AttributeSet), 15 attributes:
[A_dtype := 'float64',
A_kind := ['A'],
B_dtype := 'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'A',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
/df_only_indexables/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df_only_indexables/table._v_attrs (AttributeSet), 11 attributes:
[CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'values_block_0',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer',
values_block_0_dtype := 'float64',
values_block_0_kind := ['A', 'B']]
/df_with_data_columns/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"A": Float64Col(shape=(), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
autoindex := True
colindexes := {
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"B": Index(6, medium, shuffle, zlib(1)).is_csi=False}
/df_with_data_columns/table._v_attrs (AttributeSet), 15 attributes:
[A_dtype := 'float64',
A_kind := ['A'],
B_dtype := 'float64',
B_kind := ['B'],
CLASS := 'TABLE',
FIELD_0_FILL := 0,
FIELD_0_NAME := 'index',
FIELD_1_FILL := 0.0,
FIELD_1_NAME := 'A',
FIELD_2_FILL := 0.0,
FIELD_2_NAME := 'B',
NROWS := 5,
TITLE := '',
VERSION := '2.7',
index_kind := 'integer']
So if you want to query a column, make it a data_column
. If you don't then they will be stored in blocks by dtype (faster / less space).
You normally always want to index a column for retrieval, BUT, if you are creating and then appending multiple files to a single store, you usually turn off the index creation and do it at the end (as this is pretty expensive to create as you go).
See the cookbook for a menagerie of questions.