3

I want to store data in a HDFS file but appending new data to that file makes the index repeat. May I please know how can I avoid it ?

In [35]: hdf = pd.HDFStore('temp.h5')
In [36]: hdf.is_open
Out[36]: True

In [37]: hdf
Out[37]:
<class 'pandas.io.pytables.HDFStore'>
File path: temp.h5
Empty

Add values with index=None

In [38]: pd.DataFrame(np.random.random((3, 1)), columns=['values'], index=None).to_hdf(hdf, 'rand_values', append=True)

In [39]: hdf
Out[39]:
<class 'pandas.io.pytables.HDFStore'>
File path: temp.h5
/rand_values            frame_table  (typ->appendable,nrows->3,ncols->1,indexers->[index])

# So far so good...
In [40]: hdf['rand_values']
Out[40]:
     values
0  0.258981
1  0.743619
2  0.297104

In [41]: hdf.close()
In [42]: hdf.open()

# Add values again with INDEX=NONE
In [43]: pd.DataFrame(np.random.random((3, 1)), columns=['values'], index=None).to_hdf(hdf, 'rand_values', append=True)

Index now repeats...

In [44]: hdf['rand_values']
Out[44]:
     values
0  0.258981
1  0.743619
2  0.297104
0  0.532033
1  0.242023
2  0.431343

In [45]: hdf.close()
In [46]: hdf.open()

In [47]: hdf['rand_values']
Out[47]:
     values
0  0.258981
1  0.743619
2  0.297104
0  0.532033
1  0.242023
2  0.431343

# Print index
In [48]: hdf['rand_values'].index
Out[48]: Int64Index([0, 1, 2, 0, 1, 2], dtype='int64')

I am using Pandas 0.17.0, Python 3.4.3

Thanks.

Kevad
  • 2,781
  • 2
  • 18
  • 28
  • This looks to be related to http://stackoverflow.com/questions/16997048/how-does-one-append-large-amounts-of-data-to-a-pandas-hdfstore-and-get-a-natural – ctrl-alt-delete Nov 28 '15 at 22:00

1 Answers1

-2

The default pandas index is [0, 1, 2, ...]. When you say index=None you're really just saying "use the default please."

In [1]: import pandas as pd

In [2]: pd.DataFrame({'x': [10, 20, 30]}, index=None)
Out[2]: 
    x
0  10
1  20
2  30

You might want to keep around a number of rows and add this value to the index

In [5]: df = pd.DataFrame({'x': [10, 20, 30]}, index=None)

In [6]: df.index += 3

In [7]: df
Out[7]: 
    x
3  10
4  20
5  30
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    The index is repeated even if I don't specify `index=None` while appending via `to_hdf`.. So, how can I prevent it ? – Kevad Nov 27 '15 at 21:40
  • With `index=False` repeated index still appears. Tested on pandas v1.2.5 on a Win10 x86_64 PC. – Bill Huang Jul 26 '21 at 14:27