1

I have a daily time series of ~1.5 million rows per day, a 4-dimensional index, and 2 columns. Thus far I've put all this stuff into one DataFrame and shoved into a single group in an HDFStore. The problem now is that continuously appending to this very large frame is now uber slow and I'm wondering if I should just create one group per day and if this would speed up appends as well as reads. Muchas gracias por la ayuda!

user2734178
  • 227
  • 1
  • 9
  • this post should help http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas – Paul H Jul 17 '15 at 21:01
  • 1
    Why not place each day's data in a separate (HDF) file? – Ami Tavory Jul 17 '15 at 21:31
  • There's a lot of overhead in writing an hdf file and so many hdf files is larger than one hdf file comprising same data, but I don't know what the speed-up look like so this could be a good way to go – user2734178 Jul 17 '15 at 21:48
  • Thanks @PaulH, but that post doesn't really justify the choices so much as it provides a recipe, unless I missed something. – user2734178 Jul 17 '15 at 21:50
  • I don't think you missed anything. I wasn't trying to answer your question, but instead just expose you to some work flows people are successfully using. – Paul H Jul 17 '15 at 21:52
  • Basic question: is there a limit to the number of groups I can have in an HDFStore class? – user2734178 Jul 17 '15 at 23:55

1 Answers1

0

The docs say that you can have 16384 children in one group. This would give your more than 44 years when putting one day in one group. You could even increase this number if necessary. There is a warning that a larger number could have unwanted performance and storage impacts.

I worked with a file with 15.000+ groups in root and it worked out nicely. I think the one-group-per-day approach is better when you need to access a day at time later. Searching for something in all days could be much slower though. You need to try this out.

Depending on your use case, you could also create one group per year, one subgroup per month, and one table per day of month. This might be helpful if somebody wants have a look at the data in a graphical tool such as vitables. On the other hand, this might complicate some of your processing steps later on.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161