4

I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.

It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?

Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.

EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.

EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.

Eric
  • 2,300
  • 3
  • 22
  • 29
  • Not sure whether this should be a comment or an answer, but the related questions http://stackoverflow.com/questions/7963656/best-practices-for-mixed-usage-of-rdbms-and-files-on-filesystem and http://stackoverflow.com/questions/8952/storing-a-file-in-a-database-as-opposed-to-the-file-system may offer some enlightenment. –  Dec 29 '11 at 16:41
  • 1
    What way would the data be accessed? Are you writing a desktop app, web site, planning to use Excel? What meta data exists about the files and samples? Is it consistent in schema - i.e. does each file and sample have similar fields? – Neville Kuyt Dec 29 '11 at 17:07
  • @MarkBannister - The answers to those questions are somewhat helpful for the general cases of storing BLOB data in databases, however, my wish to access the data column-wise might make my case different. Maybe not though. – Eric Dec 29 '11 at 20:59

4 Answers4

2

DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.

My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.

saarp
  • 1,931
  • 1
  • 15
  • 28
1

I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.

Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.

Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).

Juha Laiho
  • 596
  • 2
  • 8
0

How big is each sample, and how big is each recording? Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)

If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?

That seems like a very reasonable size to put in a row in a DB rather than a file on disk.

As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.

Andrew Kuklewicz
  • 10,621
  • 1
  • 34
  • 42
  • Sorry, I probably put too many words in there. The last sentence of the first paragraph should be read as: "One file has 100,000 recordings of 2,000,000 samples each." – Eric Dec 29 '11 at 21:25
0

I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.

Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).

Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.

Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)

frozenjim
  • 746
  • 5
  • 11