Many small data table I/O for pandas?

Question

I have many table (about 200K of them) each small (typically less than 1K rows and 10 columns) that I need to read as fast as possible in pandas. The use case is fairly typical: a function loads these table one at a time, computes something on them and stores the final result (not keeping the content of the table in memory).

This is done many times over and I can choose the storage format for these tables for best (speed) performance. What natively supported storage format would be the quickest?

I think [hdf5](http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables) is best. — jezrael, Oct 12 '16 at 12:27
Intuitively, I would agree with @jezrael. In the end it might depend on your data, so I would just try it out and benchmark it properly. — cel, Oct 12 '16 at 12:29

score 1 · Accepted Answer · edited May 23 '17 at 10:34

1

IMO there are a few options in this case:

use HDF Store (AKA PyTable, H5) as @jezrael has already suggested. You can decide whether you want to group some/all of your tables and store them in the same .h5 file using different identifiers (or keys in Pandas terminology)
use new and extremely fast Feather-Format (part of the Apache Arrow project). NOTE: it's still a bit new format so its format might be changed in future which could lead to incompatibilities between different versions of feather-format module. You also can't put multiple DFs in one feather file, so you can't group them.
use a database for storing/reading tables. PS it might be slower for your use-case.

PS you may also want to check this comparison especially if you want to store your data in compressed format

edited May 23 '17 at 10:34

Community

1
1

answered Oct 12 '16 at 18:03

MaxU - stand with Ukraine

205,989
36
386
419

Thank! Is there an advantage is combining the tables in the h5 format, given that they are not going to be used together (=in the same function scope)? – user189035 Oct 12 '16 at 19:28
@user189035, one clear advantage is less amount of files. But if you can't group them into __small__ groups (for example: grouped by date/month/customer-id/etc.) then you shouldn't use it for bigger groups as such H5 files will grow up very fast and it's very difficult to reduce their size ... – MaxU - stand with Ukraine Oct 12 '16 at 19:44
There is one file for each location/day. I load each file to obtain a set of numbers (using an linear program). I can access them by either location or day (the computations are done independently on each file). So your advise is to group all these small files by either locations (200 of them) or days (5000) and that doing so will reduce loading times? – user189035 Oct 12 '16 at 19:57
I would try to keep a smaller groups (10-50), but the best option would be to test it during some time (1-2 months) and to observe H5 files and their sizes... – MaxU - stand with Ukraine Oct 12 '16 at 20:01
Yes, I have no direct access to those file currently: checking the available tools;) Thanks for your views – user189035 Oct 12 '16 at 20:02

Many small data table I/O for pandas?

1 Answers1