4

I am newbie in Pandas module. I created dataframe and save it with name "dirtree" using to_hdf:

df.to_hdf("d:/datatree full.h5", "dirtree")

I repeated actions above. After that, when I check file size, it is doubled. I guess my second dataframe was appended to old dataframe, but checking dataframes in store and counting rows says there are no extra dataframe or rows. How could it be?

My codes to check store:

store = pd.HDFStore('d:/datatree.h5')
print(store)
df = pd.read_hdf('d:/datatree.h5', 'dirtree')
df.text.count() # text is one of the columns in df
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
bzimor
  • 1,618
  • 2
  • 14
  • 26

1 Answers1

3

I could reproduce this issue in the following way:

Original sample DF:

In [147]: df
Out[147]:
          a         b           c
0  0.163757 -1.727003    0.641793
1  1.084989 -0.958833    0.552059
2 -0.419273 -1.037440    0.544212
3 -0.197904 -1.106120   -1.117606
4  0.891187  1.094537  100.000000

let's save it to HDFStore:

In [149]: df.to_hdf('c:/temp/test_dup.h5', 'x')

file size: 6992 bytes

let's do it one more time:

In [149]: df.to_hdf('c:/temp/test_dup.h5', 'x')

file size: 6992 bytes NOTE: it did NOT change

now let's open HDFStore:

In [150]: store = pd.HDFStore('c:/temp/test_dup.h5')

In [151]: store
Out[151]:
<class 'pandas.io.pytables.HDFStore'>
File path: c:/temp/test_dup.h5
/x            frame        (shape->[5,3])

file size: 6992 bytes NOTE: it did NOT change

let's save DF to HDFStore one more time, but notice that the store is open:

In [156]: df.to_hdf('c:/temp/test_dup.h5', 'x')

In [157]: store.close()

file size: 12696 bytes # BOOM !!!

Root cause:

when we do: store = pd.HDFStore('c:/temp/test_dup.h5') - it's open with the default mode 'a' (append), so it's ready to modify the store and when you write to the same file, but not using this store it makes a copy in order to protect the open store...

How to avoid it:

use mode='r' when you open a store:

In [158]: df.to_hdf('c:/temp/test_dup2.h5', 'x')

In [159]: store2 = pd.HDFStore('c:/temp/test_dup2.h5', mode='r')

In [160]: df.to_hdf('c:/temp/test_dup2.h5', 'x')
...
skipped
...
ValueError: The file 'c:/temp/test_dup2.h5' is already opened, but in read-only mode.  Please close it before reopening in append mode.

or a better way to manage your HDF files - is to use store:

store = pd.HDFStore(filename)  # it's stored in the `'table'` mode per default !
store.append('key_name', df, data_columns=True)
...
store.close()  # don't forget to flush changes to disk !!! 
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419