227

There has been a lot of talk related to Cassandra lately.

Twitter, Digg, Facebook, etc all use it.

When does it make sense to:

  • use Cassandra,
  • not use Cassandra, and
  • use a RDMS instead of Cassandra.
Luke
  • 6,195
  • 11
  • 57
  • 85
JimJim
  • 2,279
  • 2
  • 14
  • 3
  • 8
    Probably should be CW? This is pretty much just NoSQL vs Relational databases, which is pretty subjective IMO. – Ed James Apr 14 '10 at 13:43
  • 3
    I would like to know if is is suitable for messaging system. I assume if Twitter use it then it would be okay, however they might not use it for all of Twitter? – Luke Apr 14 '10 at 13:45
  • http://techblog.bozho.net/?p=232 – Bozho Sep 14 '10 at 20:28

18 Answers18

186

There is nothing like a silver bullet, everything is built to solve specific problems and has its own pros and cons. It is up to you, what problem statement you have and what is the best fitting solution for that problem.

I will try to answer your questions one by one in the same order you asked them. Since Cassandra is based on the NoSQL family of databases, it's important you understand why use a NoSQL database before I answer your questions.

Why use NoSQL

In the case of RDBMS, making a choice is quite easy because all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost the same kind of solutions oriented toward ACID properties. When it comes to NoSQL, the decision becomes difficult because every NoSQL database offers different solutions and you have to understand which one is best suited for your app/system requirements. For example, MongoDB is fit for use cases where your system demands a schema-less document store. HBase might be fit for search engines, analyzing log data, or any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like trees, queues, linked lists, etc and can be a good fit for making real-time leaderboards, pub-sub kind of system. Similarly there are other databases in this category (Including Cassandra) which are fit for different problem statements. Now lets move to the original questions, and answer them one by one.

When to use Cassandra

Being a part of the NoSQL family, Cassandra offers a solution for problems where one of your requirements is to have a very heavy write system and you want to have a quite responsive reporting system on top of that stored data. Consider the use case of Web analytics where log data is stored for each request and you want to built an analytical platform around it to count hits per hour, by browser, by IP, etc in a real time manner. You can refer to this blog post to understand more about the use cases where Cassandra fits in.

When to Use a RDMS instead of Cassandra

Cassandra is based on a NoSQL database and does not provide ACID and relational data properties. If you have a strong requirement for ACID properties (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make a workaround for that, however you will end up writing lots of application code to simulate ACID properties and will lose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.

When not to use Cassandra

I don't think it needs to be answered if the above explanation makes sense.

LeCodex
  • 1,636
  • 14
  • 20
Ajay Tiwari
  • 3,388
  • 1
  • 17
  • 13
  • 1
    The problem with the answer is that it lumps all NoSQL solutions together. See http://dataconomy.com/sql-vs-nosql-need-know/ for more info. In the NoSQL landscape the basic divisions are document, key-value, graph and big-table. They have different characteristics for different problems. A solution that is a good match for mongo may not be a good match for cassandra. – Yehosef Feb 08 '16 at 16:12
  • 19
    The only way this response "lumps all NoSQL solutions together" is by the category NoSQL; other than that the post does a great job of pointing out that each NoSQL database "offers a different solution" for different problems. I did not get the feeling that the author even slightly hinted that mongo, cassandra, or any other NoSQL database solve the same problems. – Nick Suwyn Mar 07 '16 at 20:35
  • 1
    `NoSQL database` is not a thing. `NoSQL` is just a term used for modern non-relational databases (see [wiki](https://en.wikipedia.org/wiki/NoSQL)). – eddyP23 Sep 08 '16 at 09:01
  • 2
    Also, note that not all NoSQL databases are not ACID. Graph DBs are usually ACID. – eddyP23 Sep 08 '16 at 09:05
  • Cassandra supports row level atomic operation and Atomic and Isolation per partition using Light Weight Transactions. If my requirement is to have ACID at row level can I not use Cassandra? Even for critical data? – TechEnthusiast Oct 11 '17 at 03:25
  • Note that the preference for heavy writes is mainly for full writes. It's less efficient and in particular more development effort to do heavy updating of your data stored. Which I would also add as a counter-indication for using Cassandra. – Frank Hopkins Feb 18 '19 at 10:51
57

When evaluating distributed data systems, you have to consider the CAP theorem - you can pick two of the following: consistency, availability, and partition tolerance.

Cassandra is an available, partition-tolerant system that supports eventual consistency. For more information see this blog post I wrote: Visual Guide to NoSQL Systems.

Cerbrus
  • 70,800
  • 18
  • 132
  • 147
Nathan Hurst
  • 1,740
  • 14
  • 22
  • When is the last time you saw a partition where both of the partitions were large? See my question http://stackoverflow.com/questions/7969874/is-the-cap-theorem-a-red-herring – Aaron Watters Nov 03 '11 at 12:26
  • 5
    Cassandra also apparently lets you specify your consistency requirement at query time, which may be a useful compromise for some use cases – Richard Marr Feb 11 '15 at 14:06
34

Cassandra is the answer to a particular problem: What do you do when you have so much data that it does not fit on one server ? How do you store all your data on many servers and do not break your bank account and not make your developers insane ? Facebook gets 4 Terabyte of new compressed data EVERY DAY. And this number most likely will grow more than twice within a year.

If you do not have this much data or if you have millions to pay for Enterprise Oracle/DB2 cluster installation and specialists required to set it up and maintain it, then you are fine with SQL database.

However Facebook no longer uses cassandra and now uses MySQL almost exclusively moving the partitioning up in the application stack for faster performance and better control.

Lucifer
  • 29,392
  • 25
  • 90
  • 143
Vagif Verdi
  • 4,816
  • 1
  • 26
  • 31
  • 2
    Would you know why FB stopped using Cassandra? Also what you do mean by "moving the partitioning up in the application stack"? Is it that FB uses multiple MySQL tables and decides which one to use to for a dataset using some application logic? – Manu Chadha Jul 15 '20 at 19:12
  • @Vargif Verdi MongoDB can also answer your particular problem right?... so in that case should we use mongodb or casandra? – MrSham Jul 27 '20 at 02:27
29

The general idea of NoSQL is that you should use whichever data store is the best fit for your application. If you have a table of financial data, use SQL. If you have objects that would require complex/slow queries to map to a relational schema, use an object or key/value store.

Of course just about any real world problem you run into is somewhere in between those two extremes and neither solution will be perfect. You need to consider the capabilities of each store and the consequences of using one over the other, which will be very much specific to the problem you are trying to solve.

Tom Clarkson
  • 16,074
  • 2
  • 43
  • 51
  • What is the advantage of sql when using fininacial data? – Paco Apr 26 '10 at 14:25
  • 3
    The schema is unlikely to change, it fits well in a table structure, and lost/inconsistent data could cause real problems. – Tom Clarkson Apr 27 '10 at 00:28
  • 4
    I don't understand why inconsistent data can cause real problems with banks. Scenario:You have one bank account, with $100 on above the limit on it, and two bank cards. When you try to withdraw money with the two cards at the same time at 2 different ATMs, you will get 2 times $100, and a letter with an extra fee in your mail box. The bank earns money (the extra fee for being below the limit) by using inconsistent data. It's to hard to connect all ATMs in the world with each other through one large relational database. Can you give an example where inconsistent financial data can be a problem? – Paco Apr 27 '10 at 16:00
  • 5
    That stuff is all COBOL and batch processing, and not nearly as well designed/stable as you might think. ATMs do not connect to any sort of unified data store, so are hardly a suitable example. It's like saying SQL isn't suitable for web apps because you can't give everyone on the internet direct access to your database. Besides, I never said anything about banks - think things like orders on an ecommerce site where you don't have to deal with an organization so conservative that SQL is considered new and untrusted. – Tom Clarkson Apr 28 '10 at 02:26
  • 1
    So the only reason is conservatism, no technical reason? – Paco Apr 28 '10 at 08:50
  • 1
    You seem to be missing the point. Technically anything is possible, using any set of tools, but that doesn't make it a good idea. For tracking sales, the benefits of sql outweigh the disadvantages. If you think you can set up a banking system using new technology, good luck to you. – Tom Clarkson Apr 28 '10 at 23:34
  • 6
    @Paco: The first ATM reads your balance($100), and the second ATM does the same. Both ATMs deduct $100 from $100 and write the final balance of $0 back to your account. Result: the bank loses $100. – Seun Osewa May 01 '10 at 21:42
  • 1
    @Seun Osewa: That would be a stupid bank. A normal bank would ask you to pay back $100 and a ridiculous interest rate for being below the limit and earn some money instead of losing money. – Paco May 01 '10 at 23:54
  • @Tom Clarkson: When you cannot name a benefit, there is no benefit. – Paco May 01 '10 at 23:55
  • 9
    @Paco: The point is, without proper transaction isolation, the normal bank won't even know the account has been overdrawn. They won't even know. – Seun Osewa May 03 '10 at 21:40
  • 1
    @Seun Osewa: A bank does not use atomic transactions for withdrawing money from an ATM. It would cost to much hardware to connect all ATMs in the world to the same database with atomic transactions. – Paco May 04 '10 at 09:38
15

Besides the answers given above about when to use and when not to use Cassandra, if you do decide to use Cassandra you may want to consider not using Cassandra itself, but one of the its many cousins out there.

Some answers above already pointed to various "NoSQL" systems which share many properties with Cassandra, with some small or large differences, and may be better than Cassandra itself for your specific needs.

Additionally, recently (several years after this question was originally asked), a Cassandra clone called Scylla (see https://en.wikipedia.org/wiki/Scylla_(database)) was released. Scylla is an open-source re-implementation of Cassandra in C++, which claims to have significantly higher throughput and lower latencies than the original Java Cassandra, while being mostly compatible with it (in features, APIs, and file formats). So if you're already considering Cassandra, you may want to consider Scylla as well.

Nadav Har'El
  • 11,785
  • 1
  • 24
  • 45
  • sorry but this is no answer to the original question asked – Gautam Jain Oct 31 '20 at 15:31
  • 1
    That's your opinion... 13 people thought otherwise. Let's face it - one way of *not* using Cassandra is using something which is similar to Cassandra, but not Cassandra. – Nadav Har'El Oct 31 '20 at 17:42
15

I will focus here on some of the important aspects which can help you to decide if you really need Cassandra. The list is not exhaustive, just some of the points which I have at top of my mind-

  • Don't consider Cassandra as the first choice when you have a strict requirement on the relationship (across your dataset).

  • Cassandra by default is AP system (of CAP). But, it supports tunable consistency which means it can be configured to support as CP as well. So don't ignore it just because you read somewhere that it's AP and you are looking for CP systems. Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.

  • Don't use Cassandra if your scale is not much or if you can deal with a non-distributed DB.

  • Think harder if your team thinks that all your problems will be solved if you use distributed DBs like Cassandra. To start with these DBs is very simple as it comes with many defaults but optimizing and mastering it for solving a specific problem would require a good (if not a lot) amount of engineering effort.

  • Cassandra is column-oriented but at the same time each row also has a unique key. So, it might be helpful to think of it as an indexed, row-oriented store. You can even use it as a document store.

  • Cassandra doesn't force you to define the fields beforehand. So, if you are in a startup mode or your features are evolving (as in agile) - Cassandra embraces it. So better, first think about queries and then think about data to answer them.

  • Cassandra is optimized for really high throughput on writes. If your use case is read-heavy (like cache) then Cassandra might not be an ideal choice.

rai.skumar
  • 10,309
  • 6
  • 39
  • 55
13

Right. It makes sense to use Cassandra when you have a huge amount of data, a huge number of queries but very little variety of queries. Cassandra basically works by partitioning and replicating. If all your queries will be based on the same partition key, Cassandra is your best bet. If you get a query on an attribute that is not the partition key, Cassandra allows you to replicate the whole data with a new partition key. So now you have 2 replicas of the same data with 2 different partition keys.

Which brings me to your next question. When not to use Cassandra. As I mentioned, Cassandra scales by replicating the complete database for every new partitioning key. But you can't keep making new copies again and again. So when you have a high variety in queries i.e. each query has a different column in the where clause, Cassandra is not a good option.

Now for the third question. The whole point of using RDBMS is when you want the ACID properties. If you are building something like a payment service and want each transaction to be isolated, each transaction to either complete or not happen at all, changes to be persistent despite system failure, and the money to be consistent across bank accounts before and after the transaction completes, an RDBMS is the only option that will help you achieve this.

This article actually explains the whole thing, especially when to use Cassandra or not (as opposed to some other NoSQL option) part of the question -> Choosing the best Database. Do check it out.

EDIT: To answer the question in the comments by proximab, when we think of banking systems we immidiately think "ACID is the best solution". But even banking systems are made up of several subsystems that might not even be dealing with any transaction related data like account holder's personal information, account statements, credit card details, credit histories, etc.

All of this information needs to be stored in some database or the another. Now if you store the account related information like account balance, that is something that needs to be consistent at all times. For example, if you try to send money from account A to account B, then the money that disappears from account A should instantaneousy show up in account B, and it cannot be present in both accounts at the same time. This system cannot be inconsistant at any point. This is where ACID is of utmost importance.

On the other hand if you are saving credit card details or credit histories, that should not get into the wrong hands, then you need something that allows access only to authorised users. That I believe is supported by Cassandra. That said, data like credit history and credit card transactions, I think that is an ever increasing data. Also there is only so much yo can query on this data i.e. it has a very finite number of queries. These two conditions make Cassandra a perfect solution.

Deeksha Kaul
  • 264
  • 2
  • 7
10

Talking with someone in the midst of deploying Cassandra, it doesn't handle the many-to-many well. They are doing a hack job to do their initial testing. I spoke with a Cassandra consultant about this and he said he wouldn't recommend it if you had this problem set.

Warren
  • 101
  • 2
7

You should ask your self the following questions:

  1. (Volume, Velocity) Will you be writing and reading TONS of information , so much information that no one computer could handle the writes.
  2. (Global) Will you need this writing and reading capability around the world so that the writes in one part of the world are accessible in another part of the world?
  3. (Reliability) Do you need this database to be up and running all the time and never go down regardless of which Cloud, which country, whether it's VM , Container, or Bare metal?
  4. (Scale-ability) Do you need this database to be able to continue to grow easily and scale linearly
  5. (Consistency) Do you need TUNABLE consistency where some writes can happen asynchronously where as others need to be certified?
  6. (Skill) Are you willing to do what it takes to learn this technology and the data modeling that goes with creating a globally distributed database that can be fast for everyone, everywhere?

If for any of these questions you thought "maybe" or "no," you should use something else. If you had "hell yes" as an answer to all of them, then you should use Cassandra.

Use RDBMS when you can do everything on one box. It's probably easier than most and anyone can work with it.

Rahul Singh
  • 104
  • 1
  • 3
4

Heavy single query vs. gazillion light query load is another point to consider, in addition to other answers here. It's inherently harder to automatically optimize a single query in a NoSql-style DB. I've used MongoDB and ran into performance issues when trying to calculate a complex query. I haven't used Cassandra but I expect it to have the same issue.

On the other hand, if your load is expected to be that of very many small queries, and you want to be able to easily scale out, you could take advantage of eventual consistency that is offered by most NoSql DBs. Note that eventual consistency is not really a feature of a non-relational data model, but it is much easier to implement and to set up in a NoSql-based system.

For a single, very heavy query, any modern RDBMS engine can do a decent job parallelizing parts of the query and take advantage of as much CPU and memory you throw at it (on a single machine). NoSql databases don't have enough information about the structure of the data to be able to make assumptions that will allow truly intelligent parallelization of a big query. They do allow you to easily scale out more servers (or cores) but once the query hits a complexity level you are basically forced to split it apart manually to parts that the NoSql engine knows how to deal with intelligently.

In my experience with MongoDB, in the end because of the complexity of the query there wasn't much Mongo could do to optimize it and run parts of it on multiple data. Mongo parallelizes multiple queries but isn't so good at optimizing a single one.

sinelaw
  • 16,205
  • 3
  • 49
  • 80
4

Let's read some real world cases:

http://planetcassandra.org/apache-cassandra-use-cases/

In this article: http://planetcassandra.org/blog/post/agentis-energy-stores-over-15-billion-records-of-time-series-usage-data-in-apache-cassandra

They elaborated the reason why they didn't choose MySql is because db synchronization is too slow.

(Also due to 2-phrase commit, FK, PK)


Cassandra is based on Amazon Dynamo paper

Features:

Stability

High availability

Backup performs well

Read and Write is better than HBase, (BigTable clone in java).

wiki http://en.wikipedia.org/wiki/Apache_Cassandra

Their Conclusion is:

We looked at HBase, Dynamo, Mongo and Cassandra. 

Cassandra was simply the best storage solution for the majority of our data.

As of 2018,

I would recommend using ScyllaDB to replace classic cassandra, if you need back support.

Postgres kv plugin is also quick than cassandra. How ever won't have multi-instance scalability.

CodeFarmer
  • 2,644
  • 1
  • 23
  • 32
  • You don't have to settle with only one database technology. You can actually have a combo and use whichever is appropriate for the specific issue. – Pepito Fernandez Oct 12 '17 at 14:16
3

another situation that makes the choice easier is when you want to use aggregate function like sum, min, max, etcetera and complex queries (like in the financial system mentioned above) then a relational database is probably more convenient then a nosql database since both are not possible on a nosql databse unless you use really a lot of Inverted indexes. When you do use nosql you would have to do the aggregate functions in code or store them seperatly in its own columnfamily but this makes it all quite complex and reduces the performance that you gained by using nosql.

ronaldmathies
  • 154
  • 2
  • 6
  • CouchdB, for one, allows computing aggregate functions very easily: http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Reduce_Functions. Technically, this is "in code" but it's not nearly as "complex" to accomplish as it would be with Cassandra. – user359996 Dec 02 '10 at 19:32
  • 2
    Actually I agree that it may take you a day to write aggregate in code, but you can write it to run on a backend server which will use close to 0 cycles of the database. With an SQL database, you'll get the result writing one line which may take you 5 min. but it will slow down the whole database each time you run it. So there are pros and cons both ways. My bank, for example, closes all website accesses in the middle of the night for about 10 to 15 minutes. They most certainly are using COBOL, but that's a very similar problem. – Alexis Wilke Jan 04 '13 at 01:43
2

Cassandra is a good choice if:

  1. You don't require the ACID properties from your DB.

  2. There would be massive and huge number of writes on the DB.

  3. There is a requirement to integrate with Big Data, Hadoop, Hive and Spark.

  4. There is a need of real time data analytics and report generations.

  5. There is a requirement of impressive fault tolerant mechanism.

  6. There is a requirement of homogenous system.

  7. There is a requirement of lots of customisation for tuning.

KayV
  • 12,987
  • 11
  • 98
  • 148
1

If you need a fully consistent database with SQL semantics, Cassandra is NOT the solution for you. Cassandra supports key-value lookups. It does not support SQL queries. Data in Cassandra is "eventually consistent". Concurrent lookups of data may be inconsistent, but eventually lookups are consistent.

If you need strict semantics and need support for SQL queries, choose another solution such as MySQL, PostGres, or combine use of Cassandra with Solr.

  • 1
    [Cassandra Query Language (CQL)](http://cassandra.apache.org/doc/latest/cql/) is _pretty similar_ to SQL, though. In fact, I'd say that CQL is an advantage of Cassandra over other NoSQL options for those looking for an SQL-like interface. – arussell84 Mar 09 '17 at 14:40
  • 2
    Cassandra is not technically eventually consistent. Cassandra lets you trade off consistency for availability. Cassandra is basically balancing CAP theorem. You can have eventually consistent write, and then read consistently, vice versa, or consistent on both, and this all depends on your replication factor combined with your read/write level. I get the answer did put "eventually consistent" in quotes likely for this reason, but I feel like some clarity is in order. – tsturzl Aug 11 '17 at 16:28
1

Apache cassandra is a distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.

The archichecture is purely based on the cap theorem, which is availability , and partition tolerance, and interestingly eventual consistently.

Dont Use it, if your not storing volumes of data across racks of clusters, Dont use if you are not storing Time series data, Dont Use if you not patitioning your servers, Dont use if you require strong Consistency.

Remario
  • 3,813
  • 2
  • 18
  • 25
  • Strong consistency garantees, a server always takes a write and every read provides the most recent. – Remario Dec 07 '17 at 23:50
0

Mongodb has very powerful aggregate functions and an expressive aggregate framework. It has many of the features developers are accustomed to using from the relational database world. It's document data/storage structure allows for more complex data models than Cassandra, for example.

All this comes with trade-offs of course. So when you select your database (NoSQL, NewSQL, or RDBMS) look at what problem you are trying to solve and at your scalability needs. No one database does it all.

Sam Taha
  • 161
  • 1
  • 3
0

According to DataStax, Cassandra is not the best use case when there is a need for

1- High end hardware devices. 2- ACID compliant with no roll back (bank transaction)

Mike
  • 777
  • 3
  • 16
  • 41
0
  • It does not support complete transaction management across the tables.
  • Secondary Index not supported.
  • Have to rely on Elastic search /Solr for Secondary index and the custom sync component has to be written.
  • Not ACID compliant system.
  • Query support is limited.