19

Background:

We have an in house document storage system that was implemented long ago. For whatever reason, using the database as the storage mechanism for the documents was chosen.

My question is this:

What is the best practice for storing documents? What are the alternatives? What are the pros and cons? Answers do not have to be technology or platform specific, it is more of a general best practice question.

My Thoughts:

Databases are not meant for document storage. File Systems or 3rd party Document Management systems may be of better use. Document Storage in Databases is expensive. Operations are slow. Are these logic assumptions? Perhaps this is best, but in my mind, we have better alternatives. Could oracle BFILE's (links to document on NAS or SAN) be better than BLOB / CLOB?

Details:

  • Documents are various types (pdf, word, xml)
  • The Middle Tier code is written in .net 2.0 / c#
  • Documents are stored in a Oracle 10g database in BLOB with compression (NAS Storage)
  • File sizes rage
  • The number of document is growing drastically and has no signs of slowing down
  • Inserts is typically is in the hunderds per hour during peak
  • Retreival is typically in the thousands per hour during peak
  • NAS storage and SAN storage is available

UPDATE (from questions below):

  • my background is development
  • there is associated meta-data about the files stored next to file in the database
Mike Ohlsen
  • 1,900
  • 12
  • 21
  • Do you require versioning, auditing, or complicated security structures? Do you need to associate meta-data with each file? – Bravax Feb 04 '09 at 17:08
  • You might want to check out http://stackoverflow.com/questions/3748/storing-images-in-db-yea-or-nay, that question pertains to images in a database, but some answers may be applicable. – James McMahon May 13 '09 at 22:45

13 Answers13

14

Based on my experience I'd say keep them in the database. We've moved two of our systems to doing this.

Putting it in the database means:

  • It's easy to access, even from multiple servers
  • It's backed up automatically (instead of having to have a separate job to do that)
  • You don't have to worry about space (since people keep the DB from overfilling the disk, but may forget to monitor where the documents are stored)
  • You don't have to have a complicated directory scheme

We had documents out of the database. It becomes a problem with lots of documents. A normal directory in Linux is one block, which is usually 4K. We had a directory that was 58MB because it had so many files in it (it was just a flat directory, no hierarchy). It had that many indirect blocks. It took over an hour to delete. It took minutes to get a count of the number of files in the directory. It was abysmal. This is on ext3.

With the filesystem you need:

  • Separate backup mechanism (from the DB backup)
  • To keep things in sync (so the record doesn't exist in the DB without the file being there)
  • A hierarchy for storage (to prevent the problem listed above, so no directory ends up with 10,000s of files)
  • Some way to view them from other servers if you need a cluster (so probably NFS or some such)

It's really a pain. For any non-trivial number of documents, I'd recommend against the file system based on what I've seen.

MBCook
  • 14,424
  • 7
  • 37
  • 41
  • 1
    +1 good arguments for DB storage. Now we just need a similar quality answer for the filesystem approach. :-) – Darron Feb 04 '09 at 17:17
  • Thanks. Like I said it's been a bit of a nightmare for us (we can't delete the directory without downtime!) Most people seem to like the FS approach, and if it was designed well it would work (we wouldn't run into the problems we did). But ours wasn't designed for so many documents. – MBCook Feb 04 '09 at 21:49
  • I don't have any problem with using a DB for file storage. But I could only consider doing this if I had total commitment from the team to ONLY store documents in the database, and to remove the documents from wherever else they happened to be. But you're actually creating a document management system. Aren't there any DMS's already out there? – Alan McBee Feb 04 '12 at 17:52
11

I prefer to store the document in the file system and then store a link to the file and associated file meta-data in the database.

It has proven more convenient, easier to maintain, and less expensive than the alternative.

Galwegian
  • 41,475
  • 16
  • 112
  • 158
  • Agreed. As long as the backup is similar/same to the db backup. Robust and friendly. Also, a good folder structure makes it really easy for techs to look through. – Stu Andrews Feb 04 '09 at 22:11
  • This answer is unsupported. Why is it so high rated? It's not terrible but nothing special either. – Joe Soul-bringer Feb 05 '09 at 04:05
  • How do you handle the situation with tens or thousands of documents on a file system, specifically in a flat structure? – RyanW Feb 01 '11 at 03:18
  • I prefer this answer. I am not sure about cost, but the reason I upvoted is that I'm introducing a centralized catalog into a moving team that already has a large number of documents in a variety of places. There's no practical way we can move (delete from original location) all those documents into any new repository. Plus, there are lots of great document management systems out there already to manage access and workflow; why would you want to roll your own? All you really need is centralized discovery, not centralized storage. – Alan McBee Feb 04 '12 at 17:49
8

Most enterprise-class document management systems do NOT store the object file in the database. Just because you can doesn't mean you should. If scalability and performance are important to you and you have a large document set you need to be very careful about storing the objects in the db. Consider the following:

In the case case of document imaging, 200 million TIFF files can be considered a relatively large, but not massive, system. Larger-scale systems can have over 1 billion object files. At, say, 20KB per bitonal TIFF you could have 4TB of object file storage. How long are your DB backups going to take? How long are your queries going to take? What is the frequency of access for these objects? If these objects have a high access frequency, do you want your high-end DB server spending all its time serving up files? If you have millions of objects then you need to be pretty darn careful about how you architect a solution where the objects are stored in the db.

Suppose that you are now tasked with converting those 200M TIFF files to PDF files. Be prepared to bring your solution to its knees as your database server wastes its time serving up each and every object file to the conversion process and then re-saving the results.

Just as an example, Sharepoint is famous for storing objects in the db. Sharepoint is also famous for scalability issues.

My answer:
For small systems (< 1M files) storing files in the DB can be considered. For large systems (> 1M files) storing files in the DB is a mistake.

Brian
  • 361
  • 2
  • 7
  • What are the best practices for storing >1 M files at filesystem level? Are there production-hardened solutions one can use without reinventing the wheel and avoiding common pitfalls? – yagooar Jul 01 '14 at 20:16
6

My biggest concern with storing the files in the database itself is managing the size and complexity of backups and other db maintenance operations.

One strategy to mitigate this difficulty (at least in MS SQL) is to create separate database partitions, potentially stored on different drives.

Then separate your data schema so that your metadata about the files are located on one partition, and the actual BLOB files are located in a separate partition.

These partitions can be backed up on different schedules, or even recovered separately.

BradC
  • 39,306
  • 13
  • 73
  • 89
  • +1 on creating a separate file group for image / BLOB data types – DJ. Feb 04 '09 at 18:09
  • Yes, I've seen exactly this issue. How does the back up / recovery solution for the separated partition differ and how in practical terms has it made the problem easier? – Simon Gibbs Aug 01 '09 at 12:38
  • Dividing the partitions the way I've outlined above would allow you to do a restore of the *metadata* (if a problem occurs), without having to do a restore of all the huge files. You'd still have a problem trying to recover individual files, though, because you can't restore just a single *row* of a table; you'd have to restore an entire partition (without 3rd party tools like Quest Lightspeed). – BradC Aug 03 '09 at 13:42
5

The only limit to storing documents in the database is technological.

A relation database is meant to be the persistent store of the mission critical data of an enterprise. How well it can perform that function varies from database to database and system to system, of course. But ideally the ACID properties of a relational database are intended to make it the store of all enterprise data. The file system, revision controller systems and other local store storage systems might have specific advantages but they are not designed for enterprise data storage as such.

If the documents you are storing qualify as enterprise data - if they are used persistently through-out the enterprise - then it is logical to keep them in the database. If you are having problems with storing in the database, perhaps a DBA can find a better solution. You might even have to move them out of the database for performance reasons but I don't think you should move them out of the database for best-practices reasons.

Of course, if the documents aren't enterprise data, if they're only used for one application, say, then moving them out of the database would also make sense.

Joe Soul-bringer
  • 3,294
  • 5
  • 31
  • 37
3

I've stored images as BLOBs in the database once and regretted it the first time I had to perform a batch operation on those images. It would've been much easier to do it in the file system. Also, as you mentioned, it is much faster to retrieve the documents if they live on a file system.

My simple view: the file system should store files, and a relational database should store relational data.

ern
  • 1,522
  • 13
  • 19
1

Store the binary files in the file system. Create a ASP.NET application for the storage and retrieval operations. You can be fancy with the web app (doc versioning, multi-tier security, etc). I think this is the consensus in the doc management industry.

Since your "number of document is growing drastically", looks like this is becoming large scale. You may want to start looking at third-party, out-of-the-box solutions (such as http://kofax.com/capture/ - I have an extensive experience with this!) to do the "dirty job" for you. Or better yet, consider looking at SaaS offering such as these guys http://www.edocumentsolutionsllc.com/

:-)

MarlonRibunal
  • 4,009
  • 3
  • 31
  • 37
0

Store your documents as files such as .doc if you want to be able to access the files and edit and resave them.

Store your documents as files such as .pdf or .tiff if you want actual historical copies which can be pulled back up and reproduced.

Store all the information concerning your files (such as dates, authors, location) in your database.

TheTXI
  • 37,429
  • 10
  • 86
  • 110
0

I always store core info and file path for documents in the database, but never the document itself. Rarely does the entire document need to be in the database.

This allows much more flexibility in using those documents. For example, want to used tiered backup storage and deduping mechanisms? Try that in Oracle BLOBs.

alphadogg
  • 12,762
  • 9
  • 54
  • 88
0

The only advantage I can see to storing documents in the database is the ease of moving those documents to another environment. Apart from that, I wouldn't do it for all the reasons already mentioned.

Tundey
  • 2,926
  • 1
  • 23
  • 27
0

On the contrary I would go for storage in the database for a couple of reasons:

  1. Simpler backup strategy
  2. Documents stored in the database can be indexed and searched
  3. You don't have to worry about files being moved/security tampered with
  4. Easy to port to another server in the event of a crash
  5. If the government mandates you must store data going back x years, managing this using a database is much easier

Databases are made to store data. Files are just data.

Although having said that there are benefits to storing files on the filesystem, chief one being database performance is better and the size is kept down. SQL Server 2008 allows you to have the best of both worlds using the FileStream. Read this whitepaper for more information

Adam R. Grey
  • 1,861
  • 17
  • 30
Rad
  • 8,336
  • 4
  • 46
  • 45
0

Personal Expertise: Are you a db admin or a programmer?

Security: one setting for the database vs 2 for the database and file system. Is it a concern of someone accidentally moving/deleting the files? In a complex setting an admin may choose to move files to another server and just change the Share or mapping. I know, this would never happen.

New databases are improving in this area.

JeffO
  • 7,957
  • 3
  • 44
  • 53
0

Consider storing your documents in subversion, or other version control system. You'll have a good backup, the ability to look at old versions of documents and splendid network access. See "My life on subversion".

Adam Matan
  • 128,757
  • 147
  • 397
  • 562