Scalable Image Storage

Question

I'm currently designing an architecture for a web-based application that should also provide some kind of image storage. Users will be able to upload photos as one of the key feature of the service. Also viewing these images will be one of the primary usages (via web).

However, I'm not sure how to realize such a scalable image storage component in my application. I already thought about different solutions but due to missing experiences, I look forward to hear your suggestions. Aside from the images, also meta data must besaved. Here are my initial thoughts:

Use a (distributed) filesystem like HDFS and prepare dedicated webservers as "filesystem clients" in order to save uploaded images and service requests. Image meta data are saved in a additional database including the filepath information for each image.
Use a BigTable-oriented system like HBase on top of HDFS and save images and meta data together. Again, webservers bridge image uploads and requests.
Use a completly schemaless database like CouchDB for storing both images and metadata. Additionally, use the database itself for upload and delievery by using the HTTP-based RESTful API. (Additional question: CouchDB does save blobs via Base64. Can it however return data in form of image/jpeg etc.)?

score 47 · Accepted Answer · edited Feb 21 '16 at 20:20

47

We have been using CouchDB for that, saving images as an "Attachment". But after a year the multi-dozen GB CouchDB Database files turned out to be a headache. For example CouchDB replication still has issues if you use it with very large document sizes.

So we just rewrote our software to use CouchDB for image information and Amazon S3 for the actual image storage. The code is available at http://github.com/hudora/huImages

You might want to set up a Amazon S3 compatible Storage Service on-site for your project. This keeps you flexible and leaves the amazon option without requiring external services for now. Walruss seems to become the most popular and scalable S3 clone.

I also urge you to look into the Design of Livejournal with their excellent Open Source MogileFS and Perlbal offerings. This combination is probably the most Famous image serving setup.

Also the flickr Architecture can be an inspiration, although they don't offer Open Source software to the public, like Livejournal does.

edited Feb 21 '16 at 20:20

Jonathan Hall

75,165
16
143
189

answered Dec 26 '09 at 22:39

max

29,122
12
52
79

Could you please elaborate more in detail how did you implemented the image storage. Especially it's interesting how did you do authorization. – Eugeniu Torica Oct 20 '11 at 12:27
Authorization was only by non-guessable URLs. – max Oct 21 '11 at 16:22
I mean from one side you have to add images in image storage and this function should be available to a certain user that needs to be authenticated. From the other side reads should be available to everyone so that images could be actually displayed to user. – Eugeniu Torica Oct 24 '11 at 11:16
1

Ah, I understand. The CouchDB was only accessible to our internal Servers. They all had full r/w/ permission. Further permissions who was able to upload was handled by the web app. https://bitbucket.org/petrilli/django-storages/src/5cac7fceb0f8/backends/couchdb.py is one part of the gears we have been using. – max Oct 25 '11 at 21:20
For those looking for alternatives to this problem, RiakCS is now available in open source and offers a S3 compatible API : http://basho.com/riak-cloud-storage/ – vdaubry Jul 26 '14 at 17:22

score 15 · Answer 2 · edited Mar 13 '16 at 14:53

15

"Additional question: CouchDB does save blobs via Base64."

CouchDB does not save blobs as Base64, they are stored as straight binary. When retrieving a JSON document with ?attachments=true we do convert the on-disk binary to Base64 in order to add it safely to JSON but that's just a presentation level thing.

See Standalone Attachments.

CouchDB serves attachments with the content-type they are stored with, it's possible, in fact common, to server HTML, CSS and GIF/PNG/JPEG attachments directly to browsers.

Attachments can be streamed and, in CouchDB 1.1, even support the Range header (for media streaming and/or resumption of an interrupted download).

edited Mar 13 '16 at 14:53

Jonathan Hall

75,165
16
143
189

answered Jun 07 '11 at 09:10

Robert Newson

4,631
20
18

1

At the time of writing the question, they were indeed stored as Base64. – b_erb Jun 07 '11 at 09:13
7

CouchDB has never stored attachments as Base64. What may have misled you is the ability to ask CouchDB to return attachments with the JSON of your document. To do that, it's necessary to wrap them in Base64. On disk, it's always been the real bytes. – Robert Newson Aug 21 '11 at 18:14
Yes, my comment was misleading. I was not referring to the underlying storage mechanism, but the way attachments could be accessed via the API. – b_erb Aug 21 '11 at 18:20

chrislusf · Answer 3 · 2016-06-04T07:43:43.640

10

Use Seaweed-FS (used to be called Weed-FS), an implementation of Facebook's haystack paper.

Seaweed-FS is very flexible and pared down to the basics. It was created to store billions of images and serve them fast.

edited Jun 04 '16 at 07:43

answered Jun 17 '12 at 08:05

chrislusf

1,001
9
11

1

Hello. We've got 1 server with `~3m` of thumbnails. At peak time it processes `12k` requests per second. Everything is ok, so it's good idea to try weed-fs – fedor.belov Feb 24 '15 at 11:09

danben · Answer 4 · 2009-12-25T14:03:44.327

3

Have you considered Amazon Web Services? S3 is web-based file storage, and SimpleDB is a key->attribute store. Both are performant and highly scalable. It's more expensive than maintaining your own servers and setups (assuming you are going to do it yourself and not hire people), but you get up and running much more quickly.

Edit: I take that back - its more expensive in the long run at high volumes, but for low volume it beats the initial cost of buying hardware.

S3: http://aws.amazon.com/s3/ (you could store your image files here, and for performance maybe have an image cache on your server, or maybe not)

SimpleDB: http://aws.amazon.com/simpledb/ (metadata could go here: image id mapping to whatever data you want to store)

Edit 2: I didn't even know about this, but there is a new web service called Amazon CloudFront (http://aws.amazon.com/cloudfront/). It is for fast web content delivery, and it integrates well with S3. Kind of like Akamai for your images. You could use this instead of the image cache.

edited Dec 25 '09 at 14:03

answered Dec 25 '09 at 13:58

danben

80,905
18
123
145

Thanks for that idea, I've already considered that. However, this is an educational project and we cannot use external services, especially we cannot spend money on them. Unfortunately, neither S3 nor SimpleDB is an option for us. – b_erb Dec 25 '09 at 14:04
Oh. Maybe put that in the question, then. – danben Dec 25 '09 at 14:06
Since you can't spend money, what are your hardware limitations? – danben Dec 25 '09 at 14:07
We can get the necessary amount of hardware needed as a bunch of virtualized servers inhouse. It is also rather a proof-of-concept project and at least at the beginning no application used from outside. However, scalability issues are one of the primary project implications so it should be taken into account foresight. – b_erb Dec 25 '09 at 14:11

score 3 · Answer 5 · answered Sep 29 '10 at 06:29

3

We use MogileFS. We're small scale users with less than 8TB and some 50 million files. We switched from storing in Amazon S3 some years ago to get better control of file names and performance.

It's not the prettiest software, but it's very "field tested" and basically all users are using it the same way you will be.

answered Sep 29 '10 at 06:29

Ask Bjørn Hansen

6,784
2
26
40

2

To my understanding MogileFS is better suited for this task then distributed databases (storing files there is not a very natural thing) and is better suited then e.g. HDFS (which is good for large files, slices can be stored on different nodes which is advantageous for MapReduce data locality). Images are small files that don't need slicing and MogileFS looks to handle this efficiently because it was written to fit this purpose (for LiveJournal.com). – Alexey Tigarev Dec 29 '11 at 17:12

score 2 · Answer 6 · edited Nov 16 '13 at 08:55

2

Maybe have a look at the description of Facebook hayStack

Needle in a haystack: efficient storage of billions of photos

edited Nov 16 '13 at 08:55

aehlke

15,225
5
36
45

answered Jan 14 '10 at 15:22

Leen Toelen

389
6
12

It would be useful if your answer contained some of the information you linked to. Especially because you have linked to a document requiring Facebook login it would seem, which for me equates to inaccessible. – Samuel Harmer Apr 03 '20 at 10:04

score 2 · Answer 7 · answered Mar 07 '11 at 18:49

2

As part of Cloudant, I don't want to push product.... but BigCouch solves this problem in my science application stack (physics -- nothing to do with Cloudant, and certainly nothing to do with profit!). It marries the simplicity of the CocuhDB design with the auto-sharding and scalability that is missing in single-server CouchDB. I generally use it to store a smaller number of big file (multi-GB) and a large number of small file (100MB or less). I was using S3 but the get costs actually start to add up for small files that are repeatedly accessed.

answered Mar 07 '11 at 18:49

Mike Miller

131
2

had you considered using an http cache on top of couchdb for caching the images, such as Akamai or Varnish? – onejigtwojig Aug 09 '11 at 04:50
1

`I was using S3 but the get costs actually start to add up for small files that are repeatedly accessed.` By default, Amazon S3 doesn't set Cache expiry headers for images, and this itself could amount to some extent in the bill. You should consider setting it up yourself. – Nov 28 '11 at 02:06

score 1 · Answer 8 · answered Dec 25 '09 at 14:20

Ok, if all that AWS stuff isn't going to work, here are a couple of thoughts.

As far as (3), if you put binary data into a database, the same data is going to come out. What makes it a jpeg is the format of the data, not what the database thinks it is. What makes the client (web browser) think its a jpeg is when you set the Content-type header to image/jpeg. You could also set it to something else (not recommended) like text and that's how the browser would try to interpret it.

For on-disk storage, I like CouchDB for its simplicity, but HDFS would certainly work. Here's a link to a post about serving image content from CouchDB: http://japhr.blogspot.com/2009/04/render-couchdb-images-via-sinatra.html

Edit: here's a link to a useful discussion about caching images in memcached vs serving them from disk under linux/apache.

you said `here's a link to a useful discussion...` is the link missing? — , Nov 28 '11 at 02:01

score 1 · Answer 9 · answered Dec 27 '09 at 20:07

I've been experimenting with some of the _update functionality available to CouchDB view servers in my Python view server.

One really cool thing I did was an update function for image uploads so that I could use PIL to create thumbnails and other related images and attach them to the document when they get pushed to CouchDB.

This might be useful if you need image manipulation and want to cut down on the amount of code and infrastructure you need to keep up.

score 1 · Answer 10 · answered Sep 29 '10 at 06:18

1

I've written image store on top of cassandra . We have a lot and writes and random reads read/write is low. For high read/write ratio I suggest You mongodb (GridFs).

answered Sep 29 '10 at 06:18

baklarz2048

10,699
2
31
37

It's very interesting! I write the same thing now. But I can't imagine how this method of storing will be good or not. Are you still using this method? How much content do you store? – Dmitry Belaventsev Jun 09 '12 at 08:08
3

4 PB now, I moving to hadoop now. – baklarz2048 Jun 10 '12 at 21:14
How many data is stored per node? Did you have issues with compaction (you said you case is heavy write). How about repair efficiency? – odiszapc Nov 03 '13 at 05:39
@odiszapc I don't use cassandra anymore. I had 500G to 2T per node. Cassandra satisfies the availability and "auto" scaling. Lots of problems with consistency and capacity planning. I had no problem with compaction, writes only , any updates very rare reads. – baklarz2048 Nov 03 '13 at 10:24
You said you moved too Hadoop. Hadoop is MapR framework. Did you talk about moving to HDFS? – odiszapc Nov 05 '13 at 01:06
Did you ever face with the failover situation? How is failover handled by HDFS? – odiszapc Nov 06 '13 at 00:02
@odiszapc Normal node crash is transparent to the system. I'm not involved in the project anymore but i Heard that new releases of hdfs don't have spof. – baklarz2048 Dec 15 '13 at 20:01

score 0 · Answer 11 · edited Nov 30 '16 at 05:46

Here is an example to store blob image in CouchDB using PHP Laravel. In this example, I am storing three images based on user requirements.

Establishing the connection in CouchDB.

$connection = DB::connection('your database name');

/*region Fetching the Uers Uploaded Images*/

$FirstImage = base64_encode(file_get_contents(Input::file('FirstImageInput')));
$SecondImage =base64_encode(file_get_contents(Input::file('SecondImageInput')));
$ThirdImage = base64_encode(file_get_contents(Input::file('ThirdImageInput')));

list($id, $rev) = $connection->putDocument(array(
    'name' => $name,
    'location' => $location,
    'phone' => $phone,
    'website' => $website,
    "_attachments" =>[
        'FirstImage.png' => [
            'content_type' => "image/png",
            'data' => $FirstImage
        ],
        'SecondImage.png' => [
            'content_type' => "image/png",
            'data' => $SecondImage
        ],
        'ThirdImage.png' => [
            'content_type' => "image/png",
            'data' => $ThirdImage
        ]
    ],
), $id, $rev);

...

same as you can store single image.

Scalable Image Storage

11 Answers11

Linked