Pandas HDFStore for out-of-core Sequential read/write of sets with variable sizes

Question

I want to read and write data to hdf5 file incrementally because I can't fit the data into memory.

The data to read/write is sets of integers. I only need to read/write the sets sequentially. No need for random access. Like I read set1, then set2, then set3, etc.

The problem is that I can't retrieve the sets by index.

import pandas as pd    
x = pd.HDFStore('test.hf', 'w', append=True)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe',start=0,stop=1)
print("selected:", y)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected: 0    1
dtype: int64

It doesn't select my 0th set, which is {1,10}

`index=False` http://stackoverflow.com/questions/25714549/indexing-and-data-columns-in-pandas-pytables — dot dot dot, Mar 25 '17 at 14:15
you can simply do this: `y=x.select('dframe',start=0,stop=1+1)` — MaxU - stand with Ukraine, Mar 25 '17 at 14:19
@MaxU. But that means I know that the set has two elements before I read from the file, which is not the case. I don't know the size of the set when I read the file. — dot dot dot, Mar 25 '17 at 14:22
in this case you should use `store.select('dframe', where="...")` as you did in your answer — MaxU - stand with Ukraine, Mar 25 '17 at 14:23

dot dot dot · Answer 1 · 2017-03-25T14:29:25.220

1

This way works. But I really don't know how fast is this.

And does this scan the whole file to find rows with the index?

That would be quite a waste of time.

import pandas as pd

x = pd.HDFStore('test.hf', 'w', append=True, format="table", complevel=9)
a = pd.Series([1])
x.append('dframe', a, index=True)
b = pd.Series([10,2])
x.append('dframe', b, index=True)
x.close()

x = pd.HDFStore('test.hf', 'r')
print(x['dframe'])
y=x.select('dframe','index == 0')
print('selected:')
for i in y:
    print(i)
x.close()

Output:

0     1
0    10
1     2
dtype: int64
selected:
1
10

edited Mar 25 '17 at 14:29

answered Mar 25 '17 at 14:07

dot dot dot

231
1
9

1

using `data_columns=True` - is a correct approach, but you should also create your HDF store with `table` format - `pd.HDFStore('test.hf', mode='w', format='table', append=True)` – MaxU - stand with Ukraine Mar 25 '17 at 14:22
you may want to check [this answer](http://stackoverflow.com/a/41555615/5741205) for some performance testing... – MaxU - stand with Ukraine Mar 25 '17 at 14:24
@MaxU 755ms per cycle is just so bad... had to do like 759997 cycles, and I only need to read the sets sequentially instead of random access. If I write my own code for saving/reading sequentially, it can be faster. – dot dot dot Mar 25 '17 at 14:32
I'd suggest you to open a new question, provide __reproducible__ sample data set (are you working with series in real life or with data frames?), explain what are you going to do. What `cycles` are you talking about - are you sure you need cycles at all? – MaxU - stand with Ukraine Mar 25 '17 at 14:47
yup. writing the binary io code now. probably will take half a day. oh, but then it will get moved to code review because it is "code that works as intended" – dot dot dot Mar 25 '17 at 14:48

Pandas HDFStore for out-of-core Sequential read/write of sets with variable sizes

1 Answers1