Is it a good idea to store hundreds of millions small images to a key/value store or other nosql database?

Question

I am developing a web system to handle a very large set of small images, about 100 millions images of 50kb ~ 200kb, working on ReiserFS.

For now, it is very difficult to backup and sync those large number of small files.

My question is that if it a good idea to store these small images to a key/value store or other nosql database such as GridFS (Mongodb), Tokyo Tyrant, Voldemort to gain more performance and bring better backup support?

Is your objective to solve the backup/sync problem - or a front end performance issue? I would image they could be competing objectives. — James Gaunt, Nov 12 '10 at 11:16

score 2 · Accepted Answer · edited Apr 13 '17 at 12:13

First off, have a look at this: Storing a millon images in the filesystem. While it isn't about backups, it is a worthwile discussion of the topic at hand.

And yes, large numbers of small files are pesky; They take up inodes, require space for filenames &c. (And it takes time to do backup of all this meta-data). Basically it sounds like you got the serving of the files figured out; if you run it on nginx, with a varnish in front or such, you can hardly make it any faster. Adding a database under that will only make things more complicated; also when it comes to backing up. Alas, I would suggest working harder on a in-place FS backup strategy.

First off, have you tried rsync with the -az-switches (archive and compression, respectively)? They tend to be highly effective, as it doesn't transfer the same files again and again.

Alternately, my suggestion would be to tar + gz into a number of files. In pseudo-code (and assuming you got them in different sub-folders):

foreach prefix (`ls -1`):
    tar -c $prefix | gzip -c -9 | ssh -z destination.example.tld "cat > backup_`date --iso`_$prefix.tar.gz"
end

This will create a number of .tar.gz-files that are easily transferred without too much overhead.

Justin Dearing · Answer 2 · 2010-11-12T13:40:05.353

If all your images, or at least the ones most accessed, fit into memory, then mongodb GridFS might outperform the raw file system. You have to experiment to find out.

Of course, depending on your file-system, breaking up the images into folders or not would affect images. In the past I noticed that ReiserFS is better for storing large numbers of files in a single directory. However, I don't know if thats still the best file system for the job.

score 1 · Answer 3 · answered Nov 12 '10 at 11:49

Another alternative is to store the images in SVN and actually have the image folder on the web server be an svn sandbox of the images. That simplifies backup, but will have zero net effect on performance.

Of course, make sure you configure your web server to not serve the .svn files.

Is it a good idea to store hundreds of millions small images to a key/value store or other nosql database?

3 Answers3

Linked