763

I am evaluating what might be the best migration option.

Currently, I am on a sharded MySQL (horizontal partition), with most of my data stored in JSON blobs. I do not have any complex SQL queries (already migrated away after since I partitioned my db).

Right now, it seems like both MongoDB and Cassandra would be likely options. My situation:

  • Lots of reads in every query, less regular writes
  • Not worried about "massive" scalability
  • More concerned about simple setup, maintenance and code
  • Minimize hardware/server cost
Community
  • 1
  • 1
meow
  • 27,476
  • 33
  • 116
  • 177
  • 5
    An official performance benchmark statistics is available. [Cassandra vs MongoDB vs HBase](http://planetcassandra.org/nosql-performance-benchmarks/) – Ravi Nov 11 '14 at 19:21
  • 1
    >Lots of reads in every query, less regular writes => Look for CQRS (separate your reads from your writes probably without event sourcing but check whether you can update your read model async .. sync may work too .. it depends on your use-cases) – bodrin Oct 14 '15 at 14:46
  • 4
    This is a great question actually. I wonder if there is an updated version of it? This one is very old now – slashdottir Jul 31 '18 at 22:54

6 Answers6

602

Lots of reads in every query, fewer regular writes

Both databases perform well on reads where the hot data set fits in memory. Both also emphasize join-less data models (and encourage denormalization instead), and both provide indexes on documents or rows, although MongoDB's indexes are currently more flexible.

Cassandra's storage engine provides constant-time writes no matter how big your data set grows. Writes are more problematic in MongoDB, partly because of the b-tree based storage engine, but more because of the multi-granularity locking it does.

For analytics, MongoDB provides a custom map/reduce implementation; Cassandra provides native Hadoop support, including for Hive (a SQL data warehouse built on Hadoop map/reduce) and Pig (a Hadoop-specific analysis language that many think is a better fit for map/reduce workloads than SQL). Cassandra also supports use of Spark.

Not worried about "massive" scalability

If you're looking at a single server, MongoDB is probably a better fit. For those more concerned about scaling, Cassandra's no-single-point-of-failure architecture will be easier to set up and more reliable. (MongoDB's global write lock tends to become more painful, too.) Cassandra also gives a lot more control over how your replication works, including support for multiple data centers.

More concerned about simple setup, maintenance and code

Both are trivial to set up, with reasonable out-of-the-box defaults for a single server. Cassandra is simpler to set up in a multi-server configuration since there are no special-role nodes to worry about.

If you're presently using JSON blobs, MongoDB is an insanely good match for your use case, given that it uses BSON to store the data. You'll be able to have richer and more queryable data than you would in your present database. This would be the most significant win for Mongo.

Esteban Verbel
  • 738
  • 2
  • 20
  • 39
Michael
  • 8,538
  • 2
  • 21
  • 20
  • What do you mean by "respective domains" - would you consider them as seperate types? thanks for the great replies! – meow May 24 '10 at 17:16
  • 92
    Totally different, a comment isn't big enough, but ... Cassandra is a linearly scalable (amortized constant time reads & writes) dynamo/google bigtable hybrid that features fast writes regardless of data size. It's feature set is minimalistic, little beyond that of an ordered key value store. MongoDB is a heavily featured (and fast) document store at the cost of durability and guarantees about writes persisting (since they're not immediately written to disk). They're different beasts with different philosophies, MongoDB's closer to a RDMS replacement ... – Michael May 24 '10 at 23:56
  • 30
    while Cassandra is lower level but allows for uber scaling (see Twitter/Digg/Facebook), but you're going to have to be deliberate in how you lay your data out, build secondary indexes etc, since no flexible querying is allowed. – Michael May 24 '10 at 23:59
  • Cassandra you will get similar read performance if setup is not using multiple nodes in cluster, just having 3 node with replication factor of 3 will give you similar performance as all node has all data. so performance factor can't be compared with mongodb like apple to apple – mamu Sep 11 '10 at 23:36
  • 11
    Because everyone mentioned twitter here in relation to Cassandra: they are not using Cassandra for persisting tweets, they use still MySQL here (http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html). Ok, but I can imagine that they still store lots of data for other purposes in Cassandra. – disco crazy Jan 13 '12 at 08:05
  • 1
    For those looking to store JSON in blobs, but also want the scale of Cassandra, the usergrid project (https://github.com/usergrid/stack) is a JSON store layered on Cassandra. Every field is implicitly indexed, just as in MongoDB. All open-source. You can host it yourself; alternatively, Apigee provides a freemium hosted service. Reccently, apigee published a demonstration project that allows mongodb clients to store data into usergrid, without change. (wire-level protocol emulation) – Cheeso Sep 16 '12 at 15:55
  • 7
    It looks like the global write lock may have been removed in Mongo 2.2... – Matt Farmer Oct 18 '12 at 04:16
  • It is worth to mention that you can't use more than one index when making query in MongoDB. Not sure about Cassandra yet. – Vladimir Prudnikov Oct 28 '12 at 17:14
  • @MattF but lock at database level is not much better in my opinion. I can't understand it.. lot of emotions only. – OZ_ Jan 07 '13 at 14:15
  • What is your Comment about use mongodb for use in rss reader application –  Apr 11 '13 at 18:38
  • MongoDB 2.2.x has Database level locking. But in 2.6.x they have changed the architecture and Collection level locking will be supported in coming releases. – minhas23 Jun 05 '14 at 17:27
  • 20
    Even before my project went live, I am feeling the pain points of Mongodb. Hot backup is a basic requirement. To do a hot backup in a Linux server, you have to first setup a LVM partition (not so common) and take a snapshot before every backup session. Another easy way is use Mongodb paid backup service. But, that service is expensive (2.3$/GB/month). Soon you will need a replicaset for fault tolerance. With open source version, the nodes can exchanges data only as clear text. For SSL you have to go with Entprise edition. And that is 10,000$. Goodbye Mongodb. Refactoring my code to Cassandra. – Karthik Sankar Oct 02 '14 at 15:16
  • Now there is no global-write lock in Wired Tiger engine of MongoDB. – Evgeni Nabokov Nov 04 '15 at 16:58
  • 2
    Since MongoDB 3.2, Wired Tiger storage engine is the default, which uses document-level concurrency for writes (MMAPv2 used collection-level concurrency) – thomas legrand Jan 06 '16 at 17:09
148

I've used MongoDB extensively (for the past 6 months), building a hierarchical data management system, and I can vouch for both the ease of setup (install it, run it, use it!) and the speed. As long as you think about indexes carefully, it can absolutely scream along, speed-wise.

I gather that Cassandra, due to its use with large-scale projects like Twitter, has better scaling functionality, although the MongoDB team is working on parity there. I should point out that I've not used Cassandra beyond the trial-run stage, so I can't speak for the detail.

The real swinger for me, when we were assessing NoSQL databases, was the querying - Cassandra is basically just a giant key/value store, and querying is a bit fiddly (at least compared to MongoDB), so for performance you'd have to duplicate quite a lot of data as a sort of manual index. MongoDB, on the other hand, uses a "query by example" model.

For example, say you've got a Collection (MongoDB parlance for the equivalent to a RDMS table) containing Users. MongoDB stores records as Documents, which are basically binary JSON objects. e.g:

{
   FirstName: "John",
   LastName: "Smith",
   Email: "john@smith.com",
   Groups: ["Admin", "User", "SuperUser"]
}

If you wanted to find all of the users called Smith who have Admin rights, you'd just create a new document (at the admin console using Javascript, or in production using the language of your choice):

{
   LastName: "Smith",
   Groups: "Admin"
}

...and then run the query. That's it. There are added operators for comparisons, RegEx filtering etc, but it's all pretty simple, and the Wiki-based documentation is pretty good.

Richard K.
  • 2,034
  • 1
  • 14
  • 15
  • 55
    Update (8th August 2011): Amazon's Ireland EC2 data centre had a lightning-related incident last night, and in sorting out our server recovery, I discovered one pretty crucial point: if you've got a replication set of two servers (and they're easy to setup), make sure you have an Arbiter node, so if one goes down, the other one doesn't panic and stall in Secondary mode! Trust me, that's a pain in the behind to sort out with a big database. – Richard K. Aug 08 '11 at 20:11
  • 8
    to add what @Richard K said, you should have arbiter node when you have even number of nodes (primary+secondary) in a replica set. – Amareswar Feb 03 '13 at 21:04
  • Added to that consider mongodb when more aggregation to be done on data analytics. – user1503117 Oct 01 '15 at 14:04
  • `As long as you think about indexes carefully, it can absolutely scream along, speed-wise.` Wait until your physical memory gets full and the OS starts page faulting lol – Jazzwave06 Jul 21 '19 at 12:47
122

Why choose between a traditional database and a NoSQL data store? Use both! The problem with NoSQL solutions (beyond the initial learning curve) is the lack of transactions -- you do all updates to MySQL and have MySQL populate a NoSQL data store for reads -- you then benefit from each technology's strengths. This does add more complexity, but you already have the MySQL side -- just add MongoDB, Cassandra, etc to the mix.

NoSQL datastores generally scale way better than a traditional DB for the same otherwise specs -- there is a reason why Facebook, Twitter, Google, and most start-ups are using NoSQL solutions. It's not just geeks getting high on new tech.

Jason Grant Taylor
  • 1,239
  • 1
  • 8
  • 2
  • 8
    I totally agree. I am using mongodb + mysql in one of the upcoming product that I am architecting. It is an upcoming financial product cloud. mysql is used where we absolutely need transactional capabilities. mongodb is used to store non-computing complex data structures that just need to be pulled up when required. working good so far. :) – Ram on Rails Jul 19 '13 at 17:05
  • I also used such a dual approach in most of my projects, and in some others the NFS mounted file system was used together with PostgreSQL for seismic blobs nearing 1 Gb in some cases. A path is a kind of query to the key value database. – Audrius Meškauskas Aug 28 '14 at 11:51
  • 1
    Here is a link to a question I asked about how to architect both sql and nosql databases: http://dba.stackexchange.com/questions/102053/how-to-design-databases-with-sql-and-nosql-databases?noredirect=1#comment184656_102053 I could use some insight you may have – j will May 20 '15 at 15:45
  • He already has escaped from transactions for good => now infinite scalability might be possible .. otherwise -> not :) – bodrin Oct 14 '15 at 14:44
  • If you add MySQL, it's cumbersome to scale linearly like cassandra. You might end up with a single point of failure and a clumsy way to restore your data after a server failure. – Rafael Sanches Mar 10 '16 at 22:40
  • "you do all updates to MySQL and have MySQL populate a NoSQL data store for reads". Isn't NoSQL optimized for writes, not reads? From [datastax](http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling), on Cassandra: "Cassandra is optimized for high write throughput, and almost all writes are equally efficient. If you can perform extra writes to improve the efficiency of your read queries, it’s almost always a good tradeoff. Reads tend to be more expensive and are much more difficult to tune." – socom1880 Sep 07 '16 at 03:58
  • CQRS does fit into this. – Viku Mar 13 '18 at 11:36
  • 1
    This is not a good solution if your data is distributed – Esteban Verbel Oct 25 '18 at 15:37
  • As of version 4.0, mongodb does support multi-document transactions with all ACID properties. – Grigori Melnik Jan 08 '19 at 03:05
60

I'm probably going to be an odd man out, but I think you need to stay with MySQL. You haven't described a real problem you need to solve, and MySQL/InnoDB is an excellent storage back-end even for blob/json data.

There is a common trick among Web engineers to try to use more NoSQL as soon as realization comes that not all features of an RDBMS are used. This alone is not a good reason, since most often NoSQL databases have rather poor data engines (what MySQL calls a storage engine).

Now, if you're not of that kind, then please specify what is missing in MySQL and you're looking for in a different database (like, auto-sharding, automatic failover, multi-master replication, a weaker data consistency guarantee in cluster paying off in higher write throughput, etc).

Kostja
  • 1,607
  • 10
  • 17
  • 13
    He is using sharding, which means his data is partitioned manually across servers. Mongodb can automate sharding, which may be a benefit. – fabspro Feb 14 '13 at 11:17
  • 18
    He is also storing mostly JSON blobs in RDBMS -- rendering relational design (features) useless. – Damir Sudarevic Mar 22 '13 at 11:54
  • 4
    The data model and automatic sharding are indeed different, but when choosing a database, you need to look at the storage engine *first*, and the rest of bells and whistles second. How is the storage engine going to perform under a load spike? How is autosharding feature going to perform under a data inflow spike? Before you relinquish control to the database for these important aspects, you'd better make sure it's going to be capable of the task. – Kostja Apr 30 '13 at 09:23
  • 7
    Relational model is one of the most well thought-out, efficient to implement and frugal data models out there. "Rendering relational design features useless" may relate to constraints, triggers, or referential integrity - but these all are pay per use. – Kostja Jul 12 '13 at 17:58
20

I haven't used Cassandra, but I have used MongoDB and think it's awesome.

If you're after simple setup, this is it: You simply untar MongoDB and run the mongod daemon and that's it ... it's running.

Obviously that's only a starter, but to get you started it's easy.

user2066657
  • 444
  • 1
  • 4
  • 23
dalton
  • 3,656
  • 1
  • 25
  • 25
  • 26
    AFAIK, the same applies to Cassandra as well. Untar, run the daemon. The test cluster is setup and ready for production! – asgs Jun 04 '15 at 18:55
13

I saw a presentation on mongodb yesterday. I can definitely say that setup was "simple", as simple as unpacking it and firing it up. Done.

I believe that both mongodb and cassandra will run on virtually any regular linux hardware so you should not find to much barrier in that area.

I think in this case, at the end of the day, it will come down to which do you personally feel more comfortable with and which has a toolset that you prefer. As far as the presentation on mongodb, the presenter indicated that the toolset for mongodb was pretty light and that there werent many (they said any really) tools similar to whats available for MySQL. This was of course their experience so YMMV. One thing that I did like about mongodb was that there seemed to be lots of language support for it (Python, and .NET being the two that I primarily use).

The list of sites using mongodb is pretty impressive, and I know that twitter just switched to using cassandra.

GrayWizardx
  • 19,561
  • 2
  • 30
  • 43
  • 4
    At the end of the day it is apples vs oranges comparison. Both the databases have their own strengths. Here are some things to consider - Object model, Secondary indexes, write scalability, high avaialability etc. have a blog post that explains the high level strategic differences between mongodb and cassandra here - https://scalegrid.io/blog/cassandra-vs-mongodb/ – Dharshan Aug 14 '16 at 02:19