Serve static files from Hadoop

Question

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).

To be a little more clear, it's a system that:

Must be distributed (horizontal scale), because the total size of data is very big.
Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
Generally, no requirement on processing the data (thus MapReduce is not needed)
Setting HTTP access on the data could be done easily.
(Should have) good throughput.

I am considering:

Native network file system: But it seems not feasible because the data can not fit into one machine.
Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.
MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.

Any suggestion please?

chrislusf · Accepted Answer · 2014-10-27T21:41:50.333

8

I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.

Weed-FS is getting faster when compiled with latest Golang releases.

Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.

weed upload -dir=/some/directory

Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.

And I suppose you would need replication with data center, rack aware, etc. They are in now!

edited Oct 27 '14 at 21:41

answered Jul 17 '13 at 07:57

chrislusf

1,001
9
11

Hi Chris, I have been exploring the options available for a distributed fs for serving images and weed-fs outshone all. I would like to know if there is any disadvantages in dumping files in weed without organizing them into folders. I am really new and there isn't much information available and this is why I contacted you directly. Eagerly awaiting your response. Also please can I get a sample configuration xml. I am using ubuntu, jdk-8, and connecting via a java client authored by zhangxu – qualebs Oct 01 '14 at 16:43
You will need to have a place to store the generated file ids for the uploaded files. If you need to traverse your folders, probably you can also store the folders together with the file ids. Or you can use weed-fs filer. – chrislusf Oct 27 '14 at 21:39

score 3 · Answer 2 · answered Jun 02 '13 at 07:15

3

Hadoop is optimized for large files e.g. It's default block size is 64M. A lot of small files are both wasteful and hard to manage on Hadoop.

You can take a look at other distributed file systems e.g. GlusterFS

answered Jun 02 '13 at 07:15

Arnon Rotem-Gal-Oz

25,469
3
45
68

Tejas Patil · Answer 3 · 2013-06-02T21:02:43.737

Hadoop has a rest API for acessing files. See this entry in the documentation. I feel that Hadoop is not meant for storing large number of small files.

HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes. The block size is 64 mb. So even if the file is of 10kb, it would be allocated an entire block of 64 mb. Thats a waste disk space.
If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 files of 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.

In "Hadoop Summit 2011", there was this talk by Karthik Ranganathan about Facebook Messaging in which he gave away this bit: Facebook stores data (profiles, messages etc) over HDFS but they dont use the same infra for images and videos. They have their own system named Haystack for images. Its not open source but they shared the abstract design level details about it.

This brings me to weed-fs: an open source project for inspired by Haystacks' design. Its tailor made for storing files. I have not used it till now but seems worth a shot.

score 0 · Answer 4 · answered Jun 13 '13 at 21:31

If you are able to batch the files and have no requirement to update a batch after adding to HDFS, then you could compile multiple small files into a single larger binary sequence file. This is a more efficient way to store small files in HDFS (as Arnon points out above, HDFS is designed for large files and becomes very inefficient when working with small files).

This is the approach I took when using Hadoop to process CT images (details at Image Processing in Hadoop). Here the 225 slices of the CT scan (each an individual image) were compiled into a single, much larger, binary sequence file for long streaming reads into Hadoop for processing.

Hope this helps!

G

Serve static files from Hadoop

4 Answers4