1

I'm working on a system that will generate and store large amounts of data to disk. A previously developed system at the company used ordinary files to store its data but for several reasons it became very hard to manage.

I believe NoSQL databases are good solutions for us. What we are going to store is generally documents (usually around 100K but occasionally can be much larger or smaller) annotated with some metadata. Query performance is not top priority. The priority is writing in a way that I/O becomes as small a hassle as possible. The rate of data generation is about 1Gbps, but we might be moving on 10Gbps (or even more) in the future.

My other requirement is the availability of a (preferably well documented) C API. I'm currently testing MongoDB. Is this a good choice? If not, what other database system can I use?

Community
  • 1
  • 1
Elektito
  • 3,863
  • 8
  • 42
  • 72

2 Answers2

4

The rate of data generation is about 1Gbps,... I'm currently testing MongoDB. Is this a good choice?

OK, so just to clarify, your data rate is ~1 gigaBYTE per 10 seconds. So you are filling a 1TB hard drive every 20 minutes or so?

MongoDB has pretty solid write rates, but it is ideally used in situations with a reasonably low RAM to Data ratio. You want to keep at least primary indexes in memory along with some data.

In my experience, you want about 1GB of RAM for every 5-10GB of Data. Beyond that number, read performance drops off dramatically. Once you get to 1GB of RAM for 100GB of data, even adding new data can be slow as the index stops fitting in RAM.

The big key here is:

What queries are you planning to run and how does MongoDB make running these queries easier?

Your data is very quickly going to occupy enough space that basically every query will just be going to disk. Unless you have a very specific indexing and sharding strategy, you end up just doing disk scans.

Additionally, MongoDB does not support compression. So you will be using lots of disk space.

If not, what other database system can I use?

Have you considered compressed flat files? Or possibly a big data Map/Reduce system like Hadoop (I know Hadoop is written in Java)

If C is key requirement, maybe you want to look at Tokyo/Kyoto Cabinet?


EDIT: more details

MongoDB does not support full-text search. You will have to look to other tools (Sphinx/Solr) for such things.

Larges indices defeat the purpose of using an index.

According to your numbers, you are writing 10M documents / 20 mins or about 30M / hour. Each document needs about 16+ bytes for an index entry. 12 bytes for ObjectID + 4 bytes for pointer into the 2GB file + 1 byte for pointer to file + some amount of padding.

Let's say that every index entry needs about 20 bytes, then your index is growing at 600MB / hour or 14.4GB / day. And that's just the default _id index.

After 4 days, your main index will no longer fit into RAM and your performance will start to drop off dramatically. (this is well-documented under MongoDB)

So it's going to be really important to figure out which queries you want to run.

Gates VP
  • 44,957
  • 11
  • 105
  • 108
  • Disk space is not a big limitation but RAM is. Currently the system has 48GB of RAM. I might be able to get more RAM if we move to 10Gbps and more. The queries will be running are either on metadata or some sort of full-text index (by the way, does MongoDB support full text indexing?). As to the eventual size of the database, we might have to retain two months of data (or even more). I'm not familiar with Tokyo/Kyoto. I need to read more about them. – Elektito Apr 05 '12 at 09:17
  • Nitpick: 1Gbps usually means 1 gigabit, not one gigabyte (from http://en.wikipedia.org/wiki/Gigabyte, "The unit symbol for the gigabyte is GB or Gbyte, but not Gb (lower case b) which is typically used for the gigabit.") Still seems quite fast, but I've worked in situations where financial market data was delivered at this rate. – Chris Shain Apr 05 '12 at 14:51
  • By the way, 1GB of RAM for every 5-10GB of data sounds like an awful lot to me. Why such a high ratio? Indices that I have seen all have a much smaller ratio. Larges indices defeat the purpose of using an index. – Elektito Apr 05 '12 at 16:06
  • 1
    @ChrisShain: He's saying 1 gigabyte _per 10 seconds_, so it's correct. – Elektito Apr 05 '12 at 16:07
  • @Homayoon ah yeah missed that! – Chris Shain Apr 05 '12 at 16:15
  • By my calculation, if each entry is about 100KB, that will be about 1200 entries per second which makes for about 2GB of index data per day not 14GB. I still see your point, though. Am I right in thinking that this is more or less true no matter what database I use? How will this affect _write_ performance? Also, I'm pretty sure there are other entities which handle such large amounts of data. How do they handle it? – Elektito Apr 05 '12 at 22:30
  • This is why I asked about the queries you intend to perform. People dealing with that much data generally used a tool like Hadoop. Twitter has a tool called Storm that they have built. However, the Twitter firehose seems to peak at around 250Mbps and you want 1000Mbps (http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html) So there are a limited number of entities handling this problem, they're probably mostly custom jobs. – Gates VP Apr 06 '12 at 17:27
  • Sorry I'm replying this late to your comment. I just saw it (I swear stackoverflow didn't notify me!). Anyway, about the queries, the system has only a handful of users that can query the database for the metadata (very simple queries, really). If they want to search within the documents themselves, I think I can sort out a full-text index whether the database itself supports it or not. It's okay if the queries take some time to process, as long as they don't hinder data storage. Does that answer your question regarding the kind of queries we intend to perform? – Elektito Apr 08 '12 at 23:26
  • It really sounds like you need a few things. You need a Full-Text Search engine (like Solr), you need some form of Key / Value storage for finding very specific documents. You need something that can run across multiple nodes with relatively efficient key generation. Honestly, I don't know of any DB that really does this on its own. If you're planning to process this much data, you clearly have a big team. You should consider litmus testing a few different DBs in combination to get what you need. – Gates VP Apr 09 '12 at 00:47
2

Have a look at Cassandra. It executes writes are much faster than reads. Probably, that's what you're looking for.

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
  • Correct me if I'm wrong, but isn't Cassandra a BigTable like solution suitable for when you have many columns? Will it work as a schemaless database? Also, there doesn't seem to be a C API available for Cassandra. – Elektito Apr 05 '12 at 18:06
  • 1
    @Homayoon Actually, Cassandra is schemaless. Read http://www.datastax.com/solutions/schema-less-database At least in Thrift trunk, there is C glib support, which means it's possible to make a C client for Cassandra. It's probably not well tested yet.I also started a C++ client that supports Cassandra 0.7, but I'm not sure if it has already been finished. https://github.com/thobbs/Coroebus – Maksym Polshcha Apr 05 '12 at 18:17