7

I'm developing a database that holds large scientific datasets. Typical usage scenario is that on the order of 5GB of new data will be written to the database every day; 5GB will also be deleted each day. The total database size will be around 50GB. The server I'm running on will not be able to store the entire dataset in memory.

I've structured the database such that the main data table is just a key/value store consisting of a unique ID and a Value.

Queries are typically for around 100 consecutive values, eg. SELECT Value WHERE ID BETWEEN 7000000 AND 7000100;

I'm currently using MySQL / MyISAM, and these queries take on the order of 0.1 - 0.3 seconds, but recently I've come to realize that MySQL is probably not the optimal solution for what is basically a large key/value store.

Before I start doing lots of work installing the new software and rewriting the whole database I wanted to get a rough idea of whether I am likely to see a significant performance boost when using a NoSQL DB (e.g. Tokyo Tyrant, Cassandra, MongoDB) instead of MySQL for these types of retrievals.

Thanks

Pete W
  • 1,817
  • 3
  • 18
  • 20
  • 2
    Before abandoning RDBMSes, I'd want profile against MySQL/Innodb, and postgresql. I'd also want to be sure I had appropriate indices on the table. – tpdi Aug 06 '10 at 18:34
  • Re: Indices, My table has two columns: ID BIGINT; Value FLOAT; and I have ID as my primary key as my queries always use where ID between ... – Pete W Aug 06 '10 at 18:40
  • Wow! 50GB of data in a two-column table. I'd think 0.1 to 0.3 secs isn't to be sneezed at, in the circumstances. If it's any of our business, perhaps you could tell us what you're keeping in what must be a near-record-breaking table? – Brian Hooper Aug 06 '10 at 18:48
  • If you move to Mongodb you can shard your db over multiple machines and the whole dataset will fit in memory, this will make it really fast. Sharding in MongoDB can deal with range queries. Of course, those extra machines cost money, it is up to you. You can also try using an SSD. – TTT Aug 06 '10 at 20:01
  • Brian: I should have said that currently I'm not running with a full dataset and so my 0.1-0.3s retrievals only apply for a total table size of just over 5GB (but it will eventually be 50GB), NB. my current server only has 512MB RAM(!). The data is a set of geophysical satellite data. TTT: Good point about the sharding. That is definitely an option. – Pete W Aug 06 '10 at 20:12
  • Any of the key/value nosql db's should work well for you. Especially if you have some other machines laying around that you can make a cluster with. – Zanson Aug 07 '10 at 23:02

3 Answers3

3

Please consider also OrientDB. It uses indexes with RB+Tree algorithm. In my tests with 100GB of database reads of 100 items took 0.001-0.015 seconds on my laptop, but it depends how the key/value are distributed inside the index.

To make your own test with it should take less than 1 hour.

One bad news is that OrientDB not supports a clustered configuration yet (planned for September 2010).

Lvca
  • 8,938
  • 2
  • 24
  • 25
2

I would expect Cassandra to do better where the dataset does not fit in memory than a b-tree based system like TC, MySQL, or MongoDB. Of course, Cassandra is also designed so that if you need more performance, it's trivial to add more machines to support your workload.

bakkal
  • 54,350
  • 12
  • 131
  • 107
jbellis
  • 19,347
  • 2
  • 38
  • 47
2

I use MongoDB in production for a write intensive operation where I do well over the rates you are referring to for both WRITE and READ operations, the size of the database is around 90GB and a single instance (amazon m1.xlarge) does 100QPS I can tell you that a typical key->value query takes about 1-15ms on a database with 150M entries, with query times reaching the 30-50ms time under heavy load. at any rate 200ms is way too much for a key/value store.

If you only use a single commodity server I would suggest mongoDB as it quite efficient and easy to learn if you are looking for a distributed solution you can try any Dynamo clone: Cassandra (Facebook) or Project Volemort (LinkedIn) being the most popular. keep in mind that looking for strong consistency slows down these systems quite a bit.

Asaf
  • 6,384
  • 1
  • 23
  • 44
  • 1
    Thanks -am running some benchmarks now with MongoDB, Tokyo Tyrant and Cassandra. I am definitely seeing vast improvements in query times. However, fyi bulk inserts are proving not quite so fast (compared to MySQL LOAD INFILE). – Pete W Aug 26 '10 at 22:10