1

Having studied about relational databases, document-stores, graph databases, and column-oriented databases, I concluded that something like Cassandra best fits my needs. In particular, the ability to add columns on the fly and no requirement to have a strict schema seals the deal for me. This seems to nicely bridge the gap between a rather novel graph db and a time-tested rdbms.

But I am concerned about how running Cassandra on a single node. Like many others, I can start only with a small amount of data, so more than one node to start with is just not practical. Based on another excellent SO question: Why don't you start off with a "single & small" Cassandra server as you usually do it with MySQL? I concluded that Cassandra can indeed be run just fine as a single node, as long as one is willing to give up benefits like availability which are derived from a multi-node setup.

There also seem to be ways of implementing dynamic adding of fields in an RDBMS for instance as discussed here on SO: How to design a database for User Defined Fields? This would, to some extent, mimic schemaless-ness.

So I would now like to understand how do Cassandra and MySQL compare - with regard to features and performance, on a single node setup? What would you advise someone in my situation - start with a simple RDBMS with the plan/intent to switch to Cassandra later on? Or start with Cassandra?

Community
  • 1
  • 1
ahron
  • 803
  • 6
  • 29
  • You will get opinionated responses here. There are no facts available to accurately conclude what's faster since there's no definite requirement from your side - what's fast enough? If Cassandra is appealing to you, why not just use it and see how it goes? MySQL is a relational database, you need non-relational scheme apparently with the ability to add definitions on the fly. While doable, it's not what relational databases are for - it becomes quite tricky to maintain such a schema. – N.B. Jul 29 '15 at 10:18
  • I was afraid of that - opinionated responses. In general, I'd say single millisecond read times is fast enough. But point taken, using a relational database for something which it's not designed for can indeed be a maintenance problem. Thanks for the feedback. – ahron Jul 29 '15 at 10:32
  • The read times depend mostly on where it's being read **from**. If the disk is slow - neither Cassandra or MySQL can implement some magic code that makes it faster. I'm pretty sure that all storage solutions prefer to read from cache, which is usually kept in RAM. If you know the record you want to retrieve, then both Cassandra and MySQL will operate nearly the same. The key here is knowing what you want to read. If you need to search for records first, then we get into the whole "what data structure and where is it stored" discussion. If I were you, I'd just use Cassandra. – N.B. Jul 29 '15 at 10:40
  • Yeah, I'm starting to lean that way too. But as an alternative we're also looking into possibly partitioning the data at the application level and using a traditional database instead... – ahron Jul 29 '15 at 11:34
  • Have you heard of Percona TokuDB? Maybe give that a shot? – N.B. Jul 29 '15 at 11:50
  • Actually, no. Just checked it out briefly and first look is very promising, gonna investigate this further! Many thanks :) – ahron Jul 29 '15 at 11:57
  • While you're at it, there's also MongoDB TokuMX, maybe you can combine the two into something awesome. – N.B. Jul 29 '15 at 12:25
  • MySQL has recently started supporting JSON natively. I am not entirely sure yet, but I believe with this MySQL (+Toku) might obviate the need for a document store like Mongo... http://mysqlserverteam.com/json-labs-release-native-json-data-type-and-binary-format/ – ahron Jul 29 '15 at 12:50
  • What is your use case? Both have totally different purposes and address different concerns. Do you need ACID and relations - so go for MySQL or any other RDBMS. Do you need scaling up to large data centers and terabytes to petabytes of data - so go for Cassandra. – Loic Jul 29 '15 at 16:42
  • The use case, sadly, is not as cut and dry. ACID would be good, but I can also live with eventual consistency - which Cassandra provides. Scaling up to large data centers might happen, but I don't foresee terabytes to petabytes anytime too soon. Same with relations - I will use them if available, but if not, the column families in Cassandra can be set up carefully enough. So it is really a question of performance of one over the other in the initial stages - down the line, Cassandra might become necessary anyway. – ahron Jul 29 '15 at 17:36
  • Consider this - building a monolithic storage (Cassandra scaled on X nodes) versus building a clustered app that's able to work with non-monolithic storage. For some reason, people try to stick everything into a black box and access it from everywhere instead of reversing the logic. If you consider that RDBMs have been around for a while and that Cassandra is built with Java (sorry Java fans) and is under Apache foundation - that would be a huge "I'm getting out of here" for me. Also MySQL is perfectly capable of scaling to mentioned numbers. – N.B. Jul 29 '15 at 19:12

1 Answers1

3

In a single node setup of Cassandra, many of the advantages of Cassandra are lost, so the main reason for doing that would be if you intended to expand to multiple nodes in the future. Performance would tend to favor RDBMS in most applications when using a single node since RDBMS is designed for that environment and can assume all data is local.

The strengths of Cassandra are scalability and availability. You can add nodes to increase capacity and having multiple nodes means you can deal with hardware failures and not have downtime. These strengths come at the cost of more difficult schema design since access is based primarily on consistent hashing. It also means you don't have full SQL available and often must rely on denormalization techniques to support fast access to data. Cassandra is also weak for ACID transactions since it is inherently difficult to coordinate atomic actions on multiple nodes.

RDBMS by contrast is a more mature technology. ACID transactions are no problem. Schema design is much simpler since you can add efficient indexes to any column to optimize queries, and you have joins available so that redundant data can be largely eliminated. By eliminating redundant data it is much easier to keep your data consistent, since there are not multiple copies of data that need to be updated when someone changes their address for example. But you run the risk of running out of space on a single machine to store all your data. And if you get a disk crash you will have downtime and need backups to restore the data, while Cassandra can often easily repair the data on a node that is out of sync. There is also no easy way to scale an RDBMS to handle higher transaction rates other than buying a faster machine.

There are a lot of other differences, but those are the major ones. Neither one is better than the other, but each one may be better suited to certain applications. So it really depends on the requirements of your use case which one will be a better fit.

Jim Meyer
  • 9,275
  • 1
  • 24
  • 49