How Cassandra stores the column data on disk?

Question

Say I insert three rows in cassandra in below order one by one

ID,firstname, lastname, websitename 1:fname1, lname1, site1 2:fname2, lname2, site2 3:fname3, lname3, site3

The column store stores columns together, like this:

1:fname1,2:fname2,3:fname3 1:lname1,2:lname2,3:lname3 1:site1,2:site2,3:site3

Does it mean when I insert the first row i.e 1:fname1, lname1, site1, it will each column in separate disk block for all three columns so that during firstname column has to be read in some query. all related column data is on single block ?

Will it not make write slow as it cassandra has to store the data in 3 blocks instead of one to ensure column data is tored together ?

https://stackoverflow.com/questions/13010225/why-many-refer-to-cassandra-as-a-column-oriented-database — emilly, Feb 29 '20 at 15:52

score 3 · Answer 1 · answered Feb 29 '20 at 15:28

Cassandra is not a column-oriented database, it is a partition-row store, this means that the data in your example will be stored like this:

 "YourTable" : {
   row1 : { "ID":1, "firstname":"fname1", "lastname":"lname1", "websitename":"site1", "timestamp":1582988571},
   row2 : { "ID":2, "firstname":"fname2", "lastname":"lname2", "websitename":"site2", "timestamp":1582989563}
   row3 : { "ID":3, "firstname":"fname3", "lastname":"lname3", "websitename":"site3", "timestamp":1582989572}
   ...
 }

The data is grouped and searched based on the primary key (which is the partition key and could include one or several clustering keys).

Some things to consider:

Cassandra is an append-only store, this means that when you try to update or delete a record, internally it will create a new record with the new value and a different timestamp; for the delete operation it will add a meta-data called "tombstone" that identifies the records that will be removed
Adding or removing nodes to the cluster will trigger a rearrangement of the tokens distribution, this means that the instance or server where a record can be located or maintained may change

score 1 · Accepted Answer · answered Feb 29 '20 at 14:57

1

Cassandra isn't a classical column store. It stores all inserted/updated data together, organized first by partition key, and then inside partition by clustering columns/primary keys. Data could be in different SSTables when you update them at different time point, but the compaction process will eventually try to merge them together.

If you're interested, you can use sstabledump against data files and see how data is stored. There is also a very good blog post from The Last Pickle about storage engine in the Cassandra 3.0 (it's different from previous versions).

answered Feb 29 '20 at 14:57

Alex Ott

80,552
8
87
132

`Cassandra isn't a classical column store` Doesn't it store the data column wise as stated by Bart at https://www.quora.com/What-are-the-main-differences-between-the-four-types-of-NoSql-databases-KeyValue-Store-Column-Oriented-Store-Document-Oriented-Graph-Database ? Then what's the difference b/w cassandra and document based DB ? Why it is called column based DB ? – emilly Feb 29 '20 at 15:05
Got my answer from https://stackoverflow.com/questions/13010225/why-many-refer-to-cassandra-as-a-column-oriented-database. Thanks – emilly Feb 29 '20 at 15:52
usually confusion comes from Cassandra called a wide-column store, like, Big Table, HBase, etc. (https://db-engines.com/en/article/Wide+Column+Stores) - but it was mostly before CQL time – Alex Ott Feb 29 '20 at 18:44

score 1 · Answer 3 · answered Feb 29 '20 at 16:11

Cassandra is basically a column-family database or row partitioned database along with column information not column based/columnar/column oriented database. When insert/fetch we need to mention partition(aka row key , aka primary key) column information. We can add any column at any point of time.

Column-family stores, like Cassandra, is great if you have high throughput writes and want to be able to linearly scale horizontally.

The term "column-family" comes from the original storage engine that was a key/value store, where the value was a "family" of column/value tuples. There was no hard limit on the number of columns that each key could have.

Nice explanation! can you give an example of olumn based/columnar/column oriented database as compared to Cassandra(wide-column/column family database)? Thanks! — Stan, Feb 24 '22 at 02:16

How Cassandra stores the column data on disk?

3 Answers3