1

In Cassandra, do the concepts of wide rows, partitions, clustering columns/keys, and partition keys exist at the querying language level? Or are they internal implementation issues that users of the querying language are not aware of?

Here is an example from How to understand the concept of wide row and related concepts in Cassandra?. In the commands in the query language, the above concepts seem not exist, but under the hook, they do.

Consider a table created with a as partition key and b as clustering column:

Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) 
INSERT INTO test(a,b,c) VALUES('test',2,'test2')
INSERT INTO test(a,b,c) VALUES('test',1,'test1')
INSERT INTO test(a,b,c) VALUES('test-new',1,'test1')

If you run the above query in this order cassandra will store data in following order (just check the order of column b):

test -> [b:1,c=test1] [b:2,c=test2]
test-new -> [b:1,c=test1]

pick up the cell with b:1 for partiton key test:

SELECT * from test where a='test' and b=1

Thanks.

Tim
  • 1
  • 141
  • 372
  • 590
  • partition key and clustering key concept does exist at CQL... wide row is nothig but bad case of choosing bad partittion key.. – undefined_variable Nov 29 '19 at 12:33
  • If clustering key is not defined then order by clause will not work in CQL... ORDER BY clause only works on clustering columns.. Similarly WHERE clause is most efficient using partition key – undefined_variable Nov 29 '19 at 12:34
  • Thanks. Could you be more specific? (Maybe write an answer?) – Tim Nov 29 '19 at 12:51
  • @undefined_variable Thanks. In your example, if two rows have different values of their partition keys, is it correct that they belong to different partitions, and different partitions mean different nodes or data stores? – Tim Nov 30 '19 at 14:05
  • yes.. different partition key means data belongs to different partition.. though one node is responsible for many partitions.. so different partition doesn't mean different node – undefined_variable Dec 02 '19 at 07:13
  • Thanks. @undefined_variable What does a partition mean? Also what parts of what books address my questions? – Tim Dec 02 '19 at 12:36

2 Answers2

1

CQL Schema

Based on your table schema as follows:

Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) 

The partition key is made up of "a" and "b". The following stacoverflow post I think will address all your questions as to what parition keys etc might be: Difference between partition key, composite key and clustering key in Cassandra?

Data files

Partitions and clustering columns etc are all present at the data file level (therefore at the DB). Internally this is understood by Cassandras storage engine. Using your example I created the table, flushed the keyspace and inspected the sstable using sstablemetadata

Note you do have to run the tool as the same user that Cassandra is running as (in my case it is the cassandra user:

$ sudo -u cassandra sstabledump /var/lib/cassandra/data/mc/test-bedc4ba012cf11ea93f72f6848f9d70d/md-1-big-Data.db

[
  {
    "partition" : {
      "key" : [ "test" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:35.752796Z" },
        "cells" : [
          { "name" : "c", "value" : "test1" }
        ]
      },
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ 2 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:31.144961Z" },
        "cells" : [
          { "name" : "c", "value" : "test2" }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "test-new" ],
      "position" : 54
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 95,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-11-29T17:43:41.438779Z" },
        "cells" : [
          { "name" : "c", "value" : "test1" }
        ]
      }
    ]
  }
]

We can clearly see that the key "test" has two clustering rows of values "1" and "2" respectively.

For a bit more background information on the Storage engine see: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlManageOndisk.html

Wide rows

This is not so much something you decide to use or implement, rather it is a side effect of a bad data model. A good example is imagine you had a table like so:

CREATE TABLE mc.cars (
    owner_id int PRIMARY KEY,
    car_reg text,
    owner_name text,
    price float,
    purchased date
);

While this model might be ok, imagine you then had a (lucky!) owner who had over 1000 cars in their collection. Aside from a large garage, they might also be the cause of a wide row. If however your table looked something like this:

CREATE TABLE mc.cars2 (
    owner_id int,
    car_reg text,
    owner_name text,
    price float,
    purchased date,
    PRIMARY KEY (owner_id, car_reg)
) WITH CLUSTERING ORDER BY (car_reg ASC)

You will be less likely to see a wide row as your partition key is made up of the car reg number too.

markc
  • 2,129
  • 16
  • 27
  • Thanks. What I am asking is: when using cql to create table and insert rows, do users have to specify creation of partitions i.e. wide rows, clustering columns/keys, and partition keys? In the example of query statements, I don't see we have to. So I wonder if those concepts are just internal and not exposed to users of the query language. – Tim Nov 29 '19 at 20:19
  • @Tim so you specify the partition key and clustering columns etc. This is part of the schema description. Wide rows are a side effect of how you design your schema and how the data populates the table. Its a side effect of an inefficient data model (i.e. how you model your data). Does that help? Would you like me to add a description of this into the answer? – markc Dec 02 '19 at 09:08
  • Thanks. I'd like to know more about "you specify the partition key and clustering columns etc" while I didn't see that in the example in my post, and "This is part of the schema description. Wide rows are a side effect of how you design your schema and how the data populates the table. Its a side effect of an inefficient data model". If you could add it, that would be great. Also do you know which parts of which books address these? – Tim Dec 02 '19 at 12:35
  • @Tim I expanded my answer above and added a ref to another well answered SO post – markc Dec 02 '19 at 17:30
1

Definitely - CQL syntax does have a notion of partition keys vs clustering keys. Just look at the example you provided:

Create TABLE test (a text,b int, c text, PRIMARY KEY(a,b)) 

The syntax (a,b) means, in CQL, that a is a partition key and b is a clustering key. As another example, if you were to write ((a,b,c),d,e,f) this would mean that a,b, and c are partition key columns, while d, e and f are clustering key columns. This is CQL syntax.

What this means in practice, I assume you know. Among other things, you can ask to get all the clustering rows belonging to a single partition in some known sort order - but partitions are not sorted and a full-table scan returns them in random order.

The term "wide row" is not used in CQL as a term, but the concept definitely exists, as I explained above - a "wide row" (actually, "wide partition" is more accurate) is what happens when a single partition has a lot of clustering rows - i.e., a lot of different clustering keys for the same partition key. Wide rows are supported decently in Cassandra, to a limit (reading from really huge partitions can be slower, and various pieces of the code still handle them in an inefficient manner). Some documents like this suggest that Cassandra partitions should ideally be up to 10MB in size.

Nadav Har'El
  • 11,785
  • 1
  • 24
  • 45