0

I'm currently in the process of trying to redesign the general workflow of my lab, and am coming up against a conceptual roadblock that is largely due to my general lack of knowledge in this subject. Our data currently is organized in a typical file system structure along the lines of:

Date\Cell #\Sweep #

where for a specific date there are generally multiple Cell folders, and within those Cell folders there are multiple Sweep files (these are relatively simple .csv files where the recording parameters are saved separated in .xml files). So within any Date folder there may be a few tens to several hundred files for recordings that day organized within multiple Cell subdirectory folders.

Our workflow typically involves opening multiple sweep files within a Cell folder, averaging them, and then averaging those with data points from other Cell folders, often across multiple days.

This is relatively straightforward to do with the Pandas and Numpy, although there is a certain “manual” feel to it when remotely accessing folders saved to the lab server. We also, on occasion, run into issues because we often have to pull in data from many of these files at once. While this isn’t usually an issue, the files can range between a few MBs to 1000s of MBs in size. In the latter case we have to take steps to not load the entire file into memory (or not load multiple files at once at the very least) to avoid memory issues.

As part of this redesign I have been reading about Pytables for data organization and for accessing data sets that may be too large to store within memory. So I guess my 2 main questions are

  1. If the out-of-memory issues aren’t significant (i.e. that utility wouldn’t be utilized often), are there any significant advantages to using something like Pytables for data organization over simply maintaining a file system on a server (or locally)?
  2. Is there any reason NOT to go the Pytables database route? We are redesigning our data collection as well as our storage, and one option is to collect the data directly into Pandas dataframes and save the files in the HDF5 file type. I’m currently weighing the cost/benefit of doing this over the current system of the data being stored into csv files and then loaded into Pandas for analysis later on.

My thinking is that by creating a database vs. the filesystem we current have we may 1. be able to reduce (somewhat anyway) file size on disk through the compression that hdf5 offers and 2. accessing data may overall become easier because of the ability to query based on different parameters. But my concern for 2 is that ultimately, since we’re usually just opening an entire file, that we won’t be utilizing that to functionality all that much – that we’d basically be performing the same steps that we would need to perform to open a file (or a series of files) within a file system. Which makes me wonder whether the upfront effort that this would require is worth it in terms of our overall workflow.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
dan_g
  • 2,712
  • 5
  • 25
  • 44

1 Answers1

0

First of all, I am a big fan of Pytables, because it helped me manage huge data files (20GB or more per file), which I think is where Pytables plays out its strong points (fast access, built-in querying etc.). If the system is also used for archiving, the compression capabilities of HDF5 will reduce space requirements and reduce network load for transfer. I do not think that 'reproducing' your file system inside an HDF5 file has advantages (happy to be told I'm wrong on this). I would suggest a hybrid approach: keep the normal filesystem structure and put the experimental data in hdf5 containers with all the meta-data. This way you keep the flexibility of your normal filesystem (access rights, copying, etc.) and can still harness the power of pytables if you have bigger files where memory is an issue. Pulling the data from HDF5 into normal pandas or numpy is very cheap, so your 'normal' work flow shouldn't suffer.

Ben K.
  • 1,160
  • 6
  • 20
  • I think that is the basis of my conundrum here - is there or is there not any real benefit to reproducing my current filesystem as it exists on our server with pytables database. I do think that switching over to HDF5 and combining the meta data and raw numbers is the best route, regardless of whether we pursue some type of database or not. – dan_g Aug 04 '14 at 16:03
  • I would say no to that statement. Maybe this question [here](http://stackoverflow.com/questions/22125778/how-is-hdf5-different-from-a-folder-with-files) helps you deciding. – Ben K. Aug 04 '14 at 17:51