Howto avoid cassandra tombstones when inserting NULL values

Question

My problem is that cassandra creates tombstones when inserting NULL values.

From what I understand, cassandra doesn't support NULLs and when NULL is inserted it just deletes the respective column. On one hand this is very space effective, however on the other hand it creates tombstones which degrades read performance.

This goes agains NoSql phillosophy because cassandra is saving space but degrading read performance. In NoSql world the space is cheap, however performance matters. I beleive this is the phillosophy behind saving tables in denormalized form.

I would like cassandra to use the same technique for inserting NULL as for any other value - use timestamping and during compaction preserve the latest entry - even if the entry is NULL (or we can call it "unset"). Is there any tweak in cassandra config or any approach how I would be able to achieve upserts with nulls without having tombstones ?

I came across this issue however it only allows to ignore NULL values

My use case: I have stream of events, every event identified by causeID. I'm receiving many events with same causeId and I want to store only the latest event for the same causeID (using upsert). The properties of the event may change from NULL to specific value, but also from specific value to NULL. Unfortunatelly the later case generates tombstones and degrades read performance.

Update

It seems there is no way how I could avoid tombstones. Could you advice me on techniques how to minimize them (set gc_grace_seconds to very low value). What are the risks, what to do when a node goes down for a longer period than gc_grace_seconds ?

What does null means in the later case - when changing from specific value to null? — Alex Ott, Dec 27 '18 at 12:15
Possible duplicate of [Inserting null values into cassandra](https://stackoverflow.com/questions/40713693/inserting-null-values-into-cassandra) — Mehul Gupta, Dec 27 '18 at 13:08
Changing specific value to NULL simplified example: create table event (id text PRIMARY KEY, event text); insert into event ('1', 'specific value'); insert into event ('1', null); — Tomas Bartalos, Dec 27 '18 at 15:49

score 4 · Answer 1 · answered Dec 27 '18 at 14:02

4

You can't insert NULL into Cassandra - it has special meaning there, and lead to creation of tombstones that you observe. If you want to treat NULL as special value, why not to solve this problem on application side - when you get null status, just insert any special value that couldn't be used in your table, and when you read data back, check for that special value and output null to requester...

answered Dec 27 '18 at 14:02

Alex Ott

80,552
8
87
132

1

Thank you for your suggestion, I was thinking about this solution as well, however it seems to me like a workaround for something that should work more straightforward. I have 2 problems with this solution. 1) I'm storing a complicated nested data structure so checking all fields is a nightmare. 2) data stored in Cassandra is not read by an application, the data will be analysed by data scientists with analytic tools like tableau, zeppelin, notebook, ... – Tomas Bartalos Dec 27 '18 at 15:59
Can you show your table structure, and how you perform updates? – Alex Ott Dec 27 '18 at 16:04
Unfortunately I can't share the table structure, its a DDL with 200 lines. The table have many UDTs, it has 40 top level properties with type: text, boolean, UDT, timestamp, int, list. The data is streamed from Kafka queue, processed with Spark DStream, transformed to DataFrame and saved to C* – Tomas Bartalos Dec 27 '18 at 16:19
You can avoid tombstones inside nested structures if you’ll use frozen versions of them - if you stores data via Spark, then you are writing them completely, not updating parts of the table. Same for lists, maps, and sets – Alex Ott Dec 27 '18 at 16:21
I've tried to use frozen structures, when I set them to null they also create tombstones, and I don't know how to make update to empty UDT (I can do this for empty array). The interesting part is that when I set frozen UDT to null, the tracing info of cqlsh shows it didn't read any tombstone. But the sstabledump shows deleted marker... – Tomas Bartalos Dec 27 '18 at 16:32

Aman Kumar Gupta · Answer 2 · 2021-09-09T13:13:56.320

When we want to just insert or update rows using null for values that are not specified, and even though our intention is to leave the value empty, Cassandra represents it as a tombstone causing unnecessary overhead which degrades performance.

To avoid such tombstones for save operations, cassandra has the concept of unset for a parameter value.

So you can do the following to unset a field value while saving to avoid tombstone overhead for example related to different cases:

1). If you are using express-cassandra then :

const user = new models.instance.User({
    user_id: 1235,
    user_name: models.datatypes.unset // this will not create tombstone when we want empty user_name or null
});
user.save(function(err){
    // user_name value is not set and does not create any unnecessary tombstone overhead
});

2). If you are writing cassandra raw query then for empty or null field when you know say colC will be null, then don't use it in your query.

insert into my_table(id,colA,colB) values(idVal,valA,valB) // Avoid colC

3). If you are using Node.Js Driver, you can even pass undefined on insert or update which will avoid tombstone overhead. For example

const query = 'INSERT INTO my_table (id, colC) VALUES (?, ?)';
client.execute(query, [ id, undefined ]);

4). If you are using c# driver then

// Prepare once in your application lifetime
var ps = session.Prepare("INSERT INTO my_table (id, colC) VALUES (?, ?)");

// Bind the unset value in a prepared statement
session.Execute(ps.Bind(id, Unset.Value));

For more detail on express-cassandra read the sub topic Null and unset values of https://express-cassandra.readthedocs.io/en/latest/datatypes/#cassandra-to-javascript-datatypes

For more detail on Node.js driver unset feature refer datastax https://docs.datastax.com/en/developer/nodejs-driver/4.6/features/datatypes/nulls/

For more detail on Csharp driver unset feature refer datastax https://docs.datastax.com/en/developer/csharp-driver/3.16/features/datatypes/nulls-unset/

NOTE: I tested this on Node.js cassandra 4.0 But unset feature is introduced after cassandra 2.2

Hope this will help you or somebody else.

Thanks!

score 1 · Answer 3 · answered Dec 27 '18 at 11:12

1

You cannot avoid tombstones if you particularly mention NULL in your INSERT. C* does not do a lookup before insert or writing a data which makes the writes very faster. For this purpose, C* just inserts a tombstone to avoid that value later (taking the latest update comparing the timestamp). If you want to avoid tombstone (which is recommended), you've to prepare different combinations of queries to check each one for NULL before adding it to the INSERT. If you have very few fields to check then it'll be easy to just add some IF-ELSE statements. But if there are lots of them, the code will be bigger and less readable. Shortly, you cannot insert NULL which will impact read performance later.

Inserting null values into cassandra

answered Dec 27 '18 at 11:12

Chaity

1,348
13
20

1

You don't need to generate prepared statements with different combinations of columns - just use `unset` on the existing prepared statement as described here: https://docs.datastax.com/en/developer/java-driver/3.6/manual/statements/prepared/ – Alex Ott Dec 27 '18 at 14:04
1

The data is saved via Spark app, I can turn spark.cassandra.output.ignoreNulls=true, however this will not help me, because the nulls will be just ignored so the old column values will be preserved even when they should be deleted. – Tomas Bartalos Dec 27 '18 at 16:21
If you want to delete a row or a column (inserting NULL or using DELETE statement), tombstones will be generated. There is no other way. C* does not delete in place. If you want to use null values as NULL (unset or there is no current value for the column), you just use unset as @AlexOtt mentioned or cassandra.output.ignoreNulls=true as you mentioned earlier. As AlexOtt mentioned NULL has special meaning in C* and if you want to use it to delete the previous value then tombstone will be generated. If you want to delete a data willing to ignore tombstone, that's not possible in C*. – Chaity Dec 28 '18 at 06:47

score 0 · Answer 4 · answered Sep 27 '21 at 15:08

I don't think the other answers address the original question, which is how to overwrite a non-null value in Cassandra with null without creating a tombstone. The nearest is Alex Ott's suggestion to use some special value other than null.

However, with a little bit of trickery you can insert an explicit null into Cassandra by exploiting a FROZEN tuple or user-defined type. The FROZEN keyword effectively serialises the user defined type and stores the serialised representation in the column. Crucially, the serialised representation of a UDT containing null values is not itself null.

> CREATE TYPE test_type(value INT);
> CREATE TABLE test(pk INT, cl INT, data FROZEN<test_type>, PRIMARY KEY (pk, cl));
> INSERT INTO test (pk, cl, data) VALUES (0, 0, {value: 15});
> INSERT INTO test (pk, cl, data) VALUES (0, 0, {value: null});
> INSERT INTO test (pk, cl) VALUES (0, 1);
> SELECT * FROM test;

 pk | cl | data
----+----+---------------
  0 |  0 | {value: null}
  0 |  1 |          null

(2 rows)

Here we wrote 15, then overwrote it with null, and finally added a second row to demonstrate that there is a difference between an unset cell and a cell containing a frozen UDT that itself contains null.

Of course the downside of this approach is that in your application you have to delve into the UDT for the actual value.

On the other hand, if you combine several columns into the UDT you do save a little overhead in Cassandra. (But you can't then read or write them individually. You also can't remove fields, though you can add new ones.)

Howto avoid cassandra tombstones when inserting NULL values

4 Answers4