1

I want to read and write data to hdf5 file incrementally because I can't fit the data into memory.

The data to read/write is sets of integers. I only need to read/write the sets sequentially. No need for random access. Like I read set1, then set2, then set3, etc.

The problem is that I can't retrieve the sets by index.

import pandas as pd    
x = pd.HDFStore('test.hf', 'w', append=True)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe',start=0,stop=1)
print("selected:", y)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected: 0    1
dtype: int64

It doesn't select my 0th set, which is {1,10}

dot dot dot
  • 231
  • 1
  • 9

1 Answers1

1

This way works. But I really don't know how fast is this.

And does this scan the whole file to find rows with the index?

That would be quite a waste of time.

import pandas as pd

x = pd.HDFStore('test.hf', 'w', append=True, format="table", complevel=9)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe','index == 0')
print('selected:')
for i in y:
    print(i)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected:
1
10
dot dot dot
  • 231
  • 1
  • 9
  • 1
    using `data_columns=True` - is a correct approach, but you should also create your HDF store with `table` format - `pd.HDFStore('test.hf', mode='w', format='table', append=True)` – MaxU - stand with Ukraine Mar 25 '17 at 14:22
  • you may want to check [this answer](http://stackoverflow.com/a/41555615/5741205) for some performance testing... – MaxU - stand with Ukraine Mar 25 '17 at 14:24
  • @MaxU 755ms per cycle is just so bad... had to do like 759997 cycles, and I only need to read the sets sequentially instead of random access. If I write my own code for saving/reading sequentially, it can be faster. – dot dot dot Mar 25 '17 at 14:32
  • I'd suggest you to open a new question, provide __reproducible__ sample data set (are you working with series in real life or with data frames?), explain what are you going to do. What `cycles` are you talking about - are you sure you need cycles at all? – MaxU - stand with Ukraine Mar 25 '17 at 14:47
  • yup. writing the binary io code now. probably will take half a day. oh, but then it will get moved to code review because it is "code that works as intended" – dot dot dot Mar 25 '17 at 14:48