1

I have a large number of data frames exported to a series of HDFStore files through Pandas. I need to be able to quickly pull in the most recent record, for each of these dataframes on demand.

The setup:

<class 'pandas.io.pytables.HDFStore'>
File path: /data/storage_X100.hdf
/X1                   frame_table  (typ->appendable,nrows->2652,ncols->1,indexers->[index])
/XX                   frame_table  (typ->appendable,nrows->2652,ncols->3,indexers->[index])
/Y1                   frame_table  (typ->appendable,nrows->2652,ncols->2,indexers->[index])
/YY                   frame_table  (typ->appendable,nrows->2652,ncols->3,indexers->[index])

I am storing roughly 100 data frames in each HDF file, and have around 5000 files to run through. Each of the data frames in the HDFStore are indexed with a DateTimeIndex.

For a single file, I'm currently looping through the HDFStore.keys(), and then querying the dataframe with a tail(1) like so:

store = pandas.HDFStore(filename)
lastrecs = {}
for key in store.keys():
   last = store[key].tail(1)
   lastrecs[key] = last

Is there a better way of doing this, perhaps with HDFStore.select_as_multiple? Even selecting the last record without pulling the entire data frame for a tail would probably speed things up tremendously. How can this be done?

bazel
  • 299
  • 7
  • 20

1 Answers1

4

use start and/or stop to specify a range of rows. You still need to iterate over the keys, but this will just select the last row of a table, so should be very fast.

In [1]: df = DataFrame(np.random.randn(10,5))

In [2]: df.to_hdf('test.h5','df',mode='w',format='table')

In [3]: store = pd.HDFStore('test.h5')

In [4]: store
Out[4]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable,nrows->10,ncols->5,indexers->[index])

In [5]: nrows = store.get_storer('df').nrows

In [6]: nrows
Out[6]: 10

In [7]: store.select('df',start=nrows-1,stop=nrows)
Out[7]: 
          0        1         2         3         4
9  0.221869 -0.47866  1.456073  0.093266 -0.456778

In [8]: store.close()

Here's a question using nrows (for a different purpose) here

Community
  • 1
  • 1
Jeff
  • 125,376
  • 21
  • 220
  • 187