Database vs. File System Storage with Somewhat Big Data

Question

I know there have been questions about this in the past, such as here and here, but I haven't really seen anything addressing very large datasets.

So I have a large amount of data that is structured (the data contains information about streams across the United States, each stream has time series data and all of the streams are identified by a unique ID). Right now it is stored in NetCDF files, and to avoid having a huge file, these files are split up into smaller segments. Right now if we want to get access to the data from just one stream (because not many people want to see all 2.7 million stream simultaneously) we have to loop through all of the NetCDF files and extract data for just that one stream. We also have built a REST API (django-rest-framework) that does the same thing for anyone calling the endpoints.

I feel as if there is a better way to do this as far as efficiency. I have considered building a database with all of the data in it, but what concerns me is the fear that this might actually be slower than just looping through all of the files because putting all of that data into one place would use multiple terabytes of disk space. I was reading this article about MongoDB, and it seems that their products could help solve this problem. My question is, will storing all of this data in a database save time retrieving data, and how difficult will this be to implement?

score 1 · Accepted Answer · answered Dec 11 '18 at 18:54

The short answer is "maybe".

The longer answer is that it will depend on a few factors:
1. Properly structuring your data. This means splitting unrelated data into separate documents, properly creating associations between related data, etc.
2. Proper indexing of your data. For example, if you have documents representing individual "chunks" of a stream, with a "stream ID" to identify which stream the chunks belong to, then having an index for the "stream ID" field will ensure that you can efficiently grab all chunks for that stream.
3. The resources you have available to you. You may need to look into horizontal scaling of a database, i.e. sharding, which will require you to really know what you're doing. You will likely want a dedicated DBA just to handle the setup and maintenance of the data, especially in getting replication in place to avoid the loss of one node completely killing your data set. This is going to cost money.
4. Your ability to correctly and accurately migrate all of that data into the database. One little slip-up could mean that you're missing an important chunk, or data that should be associated isn't, or data is entered incorrectly or as the wrong type, or any number of problems.

It's definitely recommended that you use a database. The indexing and data separation alone will have a tremendous impact on the efficiency of data retrieval, even with such a large amount of data. If nothing else, the reduced file I/O and getting rid of direct parsing of file contents should make things much faster. But if you're going to use a database, you need to be incredibly careful. There is a ton of work involved that you shouldn't be taking on if you have terabytes of existing data that you need to preserve. You're going to want someone experienced to handle the migration, setup, and long-term maintenance. This is no light undertaking.

Just another quick question, if the main goal of the API was to just serve the users the time series for a single stream (ex. I request 30 years of time series data for stream ID 192, and the API returns a JSON with my data) would setting this up in a database be simpler? — pythonweb, Dec 11 '18 at 20:28
It's not so much a matter of what the API intends to do, and more a matter of difficulties in migrating the existing data. Inserting data is easy, but inserting the data correctly and ensuring that nothing has gone awry is difficult. — B. Fleming, Dec 11 '18 at 21:13

Database vs. File System Storage with Somewhat Big Data

1 Answers1