I'm currently in the process of trying to redesign the general workflow of my lab, and am coming up against a conceptual roadblock that is largely due to my general lack of knowledge in this subject. Our data currently is organized in a typical file system structure along the lines of:
Date\Cell #\Sweep #
where for a specific date there are generally multiple Cell folders, and within those Cell folders there are multiple Sweep files (these are relatively simple .csv files where the recording parameters are saved separated in .xml files). So within any Date folder there may be a few tens to several hundred files for recordings that day organized within multiple Cell subdirectory folders.
Our workflow typically involves opening multiple sweep files within a Cell folder, averaging them, and then averaging those with data points from other Cell folders, often across multiple days.
This is relatively straightforward to do with the Pandas and Numpy, although there is a certain “manual” feel to it when remotely accessing folders saved to the lab server. We also, on occasion, run into issues because we often have to pull in data from many of these files at once. While this isn’t usually an issue, the files can range between a few MBs to 1000s of MBs in size. In the latter case we have to take steps to not load the entire file into memory (or not load multiple files at once at the very least) to avoid memory issues.
As part of this redesign I have been reading about Pytables for data organization and for accessing data sets that may be too large to store within memory. So I guess my 2 main questions are
- If the out-of-memory issues aren’t significant (i.e. that utility wouldn’t be utilized often), are there any significant advantages to using something like Pytables for data organization over simply maintaining a file system on a server (or locally)?
- Is there any reason NOT to go the Pytables database route? We are redesigning our data collection as well as our storage, and one option is to collect the data directly into Pandas dataframes and save the files in the HDF5 file type. I’m currently weighing the cost/benefit of doing this over the current system of the data being stored into csv files and then loaded into Pandas for analysis later on.
My thinking is that by creating a database vs. the filesystem we current have we may 1. be able to reduce (somewhat anyway) file size on disk through the compression that hdf5 offers and 2. accessing data may overall become easier because of the ability to query based on different parameters. But my concern for 2 is that ultimately, since we’re usually just opening an entire file, that we won’t be utilizing that to functionality all that much – that we’d basically be performing the same steps that we would need to perform to open a file (or a series of files) within a file system. Which makes me wonder whether the upfront effort that this would require is worth it in terms of our overall workflow.