I could reproduce this issue in the following way:
Original sample DF:
In [147]: df
Out[147]:
a b c
0 0.163757 -1.727003 0.641793
1 1.084989 -0.958833 0.552059
2 -0.419273 -1.037440 0.544212
3 -0.197904 -1.106120 -1.117606
4 0.891187 1.094537 100.000000
let's save it to HDFStore:
In [149]: df.to_hdf('c:/temp/test_dup.h5', 'x')
file size: 6992 bytes
let's do it one more time:
In [149]: df.to_hdf('c:/temp/test_dup.h5', 'x')
file size: 6992 bytes
NOTE: it did NOT change
now let's open HDFStore:
In [150]: store = pd.HDFStore('c:/temp/test_dup.h5')
In [151]: store
Out[151]:
<class 'pandas.io.pytables.HDFStore'>
File path: c:/temp/test_dup.h5
/x frame (shape->[5,3])
file size: 6992 bytes
NOTE: it did NOT change
let's save DF to HDFStore one more time, but notice that the store
is open:
In [156]: df.to_hdf('c:/temp/test_dup.h5', 'x')
In [157]: store.close()
file size: 12696 bytes
# BOOM !!!
Root cause:
when we do: store = pd.HDFStore('c:/temp/test_dup.h5')
- it's open with the default mode 'a'
(append), so it's ready to modify the store and when you write to the same file, but not using this store
it makes a copy in order to protect the open store...
How to avoid it:
use mode='r'
when you open a store:
In [158]: df.to_hdf('c:/temp/test_dup2.h5', 'x')
In [159]: store2 = pd.HDFStore('c:/temp/test_dup2.h5', mode='r')
In [160]: df.to_hdf('c:/temp/test_dup2.h5', 'x')
...
skipped
...
ValueError: The file 'c:/temp/test_dup2.h5' is already opened, but in read-only mode. Please close it before reopening in append mode.
or a better way to manage your HDF files - is to use store:
store = pd.HDFStore(filename) # it's stored in the `'table'` mode per default !
store.append('key_name', df, data_columns=True)
...
store.close() # don't forget to flush changes to disk !!!