2

I understand that Cassandra is massively scalable, but it currently has a limitation for storing 2 billion individual pieces of information.

Now, say I want to store information in a table and I have 20 billion data points. An example might be storing multiple devices (desktop PC, mobile devices, etc.) per user, where there are over 7 billion individuals (possible users) on the planet. With multiple devices per person, it is conceivable that the data set could reach 20+ billion records.

  1. Can Cassandra handle this scenario? If possible, then how?
  2. If not, how can this scenario be handled?
William Price
  • 4,033
  • 1
  • 35
  • 54
Avijoy Chakma
  • 147
  • 3
  • 14

1 Answers1

7

Yes, Cassandra can store 20 billion or more individual pieces of data.

The maximum number of cells (rows x columns) in a single partition is 2 billion.

This is the limitation you alluded to, but it is more specific than your interpretation. Specifically that limit is for a single partition. Were you to insert the maximum 2 x 109 records into a partition, it would require a minimum of 10 separate partitions to, collectively, store the hypothetical 20B records. Creating 10 partitions is easy.

This is the answer to the "how" in the original question: Cassandra scales beyond this limitation when you, the application developer, split the data across multiple partitions.

In fact, a well-designed, healthy Cassandra cluster will consist of thousands or millions (or more) individual partitions. While each partition can theoretically contain a unique set of two billion data points, in practice you are unlikely to see partitions grow to be that large, and you should not design your schema with the intent to reach that limit. (After all, it is a limit and should be avoided.)

A single node (separate machine) in a Cassandra cluster can store multiple partitions, but the data for each partition must be able to reside completely within one node. That node must also perform sort operations on the partition when making changes to its data. You can probably imagine that sorting anywhere close to a billion data points will take measurable amounts of time. Instead, Cassandra intends you to scale "massively" by distributing the work by distributing the data across multiple nodes. Production clusters can easily consist of dozens, hundreds, or even thousands of individual nodes.

  1. Avoid getting anywhere close to the 2B/partition limit by splitting data across many partitions.
  2. Each node will be able to hold a finite number of partitions, based on the capacity of its disk.
  3. Avoid being limited by disk space by adding more nodes to your cluster, thus distributing the same data across more disks.
Community
  • 1
  • 1
William Price
  • 4,033
  • 1
  • 35
  • 54
  • Thumb up for 'In fact, a well-designed, healthy Cassandra cluster will consist of thousands or millions (or more) individual partitions' – Carlo Bertuccini Nov 28 '14 at 20:26
  • Thanks William for the answer but it seems still confusing as there are more other documents regarding the cassandra limitation and interpretation. So far what I understood- 1. Using partition key we can create any number of partition 2. In each partition(Each row), it can have 2B column values. – Avijoy Chakma Nov 30 '14 at 03:38
  • Partition != row. Columns are stored in rows. Rows are stored in partitions. Partitions are stored in nodes. One or more nodes make up a cluster. All of those terms are _different things_. The count of all columns of all rows in _one_ partition must be <= 2 billion. (Tip: make sure you're reading documents that pertain to _recent versions_ of Cassandra; some of the terminology changed meaning a little bit when they introduced CQL.) – William Price Nov 30 '14 at 03:42
  • @William would you check the following link ?
    [link](http://stackoverflow.com/questions/20512710/cassandra-has-a-limit-of-2-billion-cells-per-partition-but-whats-a-partition/27150516#27150516). Actually a straight forward example would be great to understand and interpret.
    – Avijoy Chakma Nov 30 '14 at 03:43
  • See slide #48 of [this presentation](http://www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure) – William Price Nov 30 '14 at 03:50
  • For creating a single partition, we will use a unique partition key. Then using different clustering key values, we can have multiple rows under same partition. Now *The count of all columns of all rows in one partition must be <= 2 billion*. Am I correct now ? – Avijoy Chakma Nov 30 '14 at 03:55
  • As that's what I said in a previous comment, yes. – William Price Nov 30 '14 at 03:57