Storing millions of log files - Approx 25 TB a year

Question

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.

I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.

As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.

The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.

Ankur

Just to do the math: that's 500GB/week or 100GB each business day. — egrunin, Oct 09 '10 at 05:39
@egrunin Thanks for the math. We already have a years worth data. @chaosThese log files come from storage arrays installed globally. — Ankur Gupta, Oct 09 '10 at 05:51
@Ankur, would a JSON format work for you if it had one object per log message, with one of the object's properties being the original log message and the others being queryable fields extracted from that log message? It increases the data storage requirements, but would allow MongoDB and CouchDB to be considered. — Jim Ferrans, Oct 09 '10 at 06:29
@jim what an idea ?. I didn't think of that. Thanks. I think it does make couchdb and mongodb a contender. I don't want to query the logfiles only store and provide a REST API on top. — Ankur Gupta, Oct 09 '10 at 06:35
Take a look at Vertica too, it seems to be quite good at this sort of thing. — Jim Ferrans, Oct 09 '10 at 06:41
So, all you need to do is store files and retrieve them by file name? How is a **file system** not suited to that task? — JoshD, Oct 09 '10 at 06:47
@JoshD It's currently on top of an NFS as I mentioned. I am looking for something better. Faster seek time, Automatic compression. Yes I can always write code to do these. Is there an off the shelf product for this ? Like Jim mentioned above I could also put mongoDB etc to use. So just learning what my options are. — Ankur Gupta, Oct 09 '10 at 06:56
@Ankur Gupta: My thought was that if you're just storing and retrieving files (and printing the file list), a database is not the best solution. Files systems are exactly what you need, so that's what I'd suggest looking into. If listing the files takes too long, break them into several folders (perhaps each week or each month). — JoshD, Oct 09 '10 at 07:05
It seems to me that all is needed is a smart folder structure with automatically generated sub folders to prevent too many files in one folder. And a little bit of code for compression and decompression. Afaik MongoDB and CouchDB don't support compression and decompression. — TTT, Oct 10 '10 at 04:21
mongodb works with memory mapped files. You can't store more data than the available virtual address space you have available. Keep in mind that most 64 bit machines only supports 48 bit of virtual address space, so you'll run out when you hit about 281 TB :-) — nos, Oct 13 '10 at 19:58
Have you considered Logstash? It's an open source log collector, which can store logs in a distributed ElasticSearch cluster, which should be able to scale horizontally. — Jon Skarpeteig, Mar 21 '13 at 10:30

score 4 · Answer 1 · answered Oct 11 '10 at 07:28

4

Since you dont want queriying features, You can use apache hadoop.

I belive HDFS and HBase will be nice fit for this.

You can see lot of huge storage stories inside Hadoop powered by page

answered Oct 11 '10 at 07:28

RameshVel

64,778
30
169
213

Look at the flume connector for hadoop. Hadoop has a lot of plugins for managing large amounts of data. – Amala Oct 11 '10 at 17:52
@RameshVel what if you want querying features? – Mark Evans May 30 '14 at 16:56

Jim Ferrans · Answer 2 · 2010-10-10T01:22:43.603

Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.

Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.

score 3 · Answer 3 · answered Oct 12 '10 at 18:24

3

Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.

http://www.gluster.org/

answered Oct 12 '10 at 18:24

Nauman

301
1
2

Forgot to mention that it is open source as well. – Nauman Oct 12 '10 at 18:25

score 3 · Answer 4 · answered Oct 13 '10 at 17:16

I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.

Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.

score 0 · Answer 5 · edited Oct 18 '10 at 02:53

If you are to choose a document database:

On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.

A similar approach is possible with Mongo's GridFs, but you would build the API yourself.

Also HDFS is a very nice choice.

Storing millions of log files - Approx 25 TB a year

5 Answers5