8

I'm having trouble figuring out how to maintain attribute updates across multiple tables to ensure data consistency.

For example, suppose I have many-to-many relationship between actors and fans. A fan can support many actors, and an actor have many fans. I make several tables to support my queries

CREATE TABLE fans (
    fan_id uuid,
    fan_attr_1 int,
    fan_attr_2 int
    PRIMARY KEY ((fan_id))
)

CREATE TABLE actors (
    actor_id uuid,
    actor_attr_1 int,
    actor_attr_2 int
    PRIMARY KEY ((actor_id))
)

CREATE TABLE actors_by_fan (
    fan_id uuid,
    actor_id uuid,
    actor_attr_1 int,
    actor_attr_2 int
    PRIMARY KEY (fan_id, actor_id)
)

CREATE TABLE fans_by_actor (
    actor_id uuid,
    fan_id uuid,
    fan_attr_1 int,
    fan_attr_2 int
    PRIMARY KEY (actor_id, fan_id)
)

Let's say I'm a fan and I'm on my settings page and I want to change my fan_attr_1 to a different value.

On the fans table I can update my attribute just fine since the application knows my fan_id and can key on that.

However I cannot change my fan_attr_1 on the fans_by_actor without first querying for the actor_ids tied to the fan.

This problem occurs for any time you want to update any attribute of either fans or actors.

I've tried looking online for people experiencing similar problems, but I couldn't find them. For example, in Datastax's Data Modeling course they use the examples with actors and videos in a many to many relationship where they have tables actors_by_video and videos_by_actor. The course, like the other online resources I've consulted, discussed modeling tables after queries, but haven't dug into how to maintain data integrity. In the actors_by_video table, what would happen if I want to change an actor's attribute? Wouldn't have have to go through every row of actors_by_video to find the partitions that contain the actor and update the attribute? That sounds very inefficient. The other option is to look for the video id's beforehand, but I read elsewhere that reads before writes are an antipattern in Cassandra.

What would be the best approach for tackling this problem either from a data modeling standpoint or from a CQL standpoint?

EDIT: - Fixed sentence stubs - Added context and prior research

James Maa
  • 81
  • 4
  • What queries are you going to make? You model your data around queries, not around their relations (its not a relational db). Might wanna check out https://academy.datastax.com/resources/ds220-data-modeling for some tutorials on data modeling in cassandra – Chris Lohfink Dec 31 '16 at 04:27
  • Yeah I actually took the course before asking here but the course doesn't seem to answer the questions themselves. In their KillrVideo example, they use both a `videos_by_actor` and `actors_by_videos`, which is a similar paradigm described above. However, they never discussed how those tables should be maintained should an actor attribute or a video attribute change – James Maa Dec 31 '16 at 07:18
  • this is really a good question. All data modeling tutorials on these nosql data stores seem to ignore the topic about how to maintain data integrity if you have duplicated data across different tables: http://cassandra.apache.org/doc/latest/cassandra/data_modeling/data_modeling_logical.html – Alan Nov 25 '21 at 17:34

3 Answers3

2

Data Modeling

Cassandra is not an Relational Database and there are certain basic rules need to be followed on DataModeling, at high-level the following goals need to be followed for our data model.

1) Spread data evenly around the cluster

2) Minimize the number of partitions read

Moreover we should go for a single big table rather than breaking it into multiple tables and adding relationship between the tables. In this approach duplication of records will occur. Duplication of records is not a costlier operation since it takes only a little more Disk Space rather than CPU, memory, disk IOPs, or network.

Please note that there is a size restriction on column key names and values. The maximum column key (and row key) size is 64KB. The maximum column value size is 2 GB. But becuase there is no streaming and the whole value is fetched in heap memory when requested, limit the size to only a few MBs.

More Info:

http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/

https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html

CQL

Maintaining Consistency across tables can be done using Batch or Materialized Views. Materialized views is available from version 3.0

Please see

How to ensure data consistency in Cassandra on different tables?

My preference would be to change the data model and design it accordingly for our queries and if possible make it as a single big table.

Hope it Helps!

Community
  • 1
  • 1
Clement Amarnath
  • 5,301
  • 1
  • 21
  • 34
  • 3
    I'm still not seeing the light here, maybe a concrete example will help. Let's say you're modeling the example described above. How would you structure the table to support the queries? – James Maa Dec 31 '16 at 07:20
1

Materialized Views are probably the best choice:

CREATE MATERIALIZED VIEW actors_by_fan 
AS SELECT fan_id, actor_id, actor_attr_1, actor_attr_2
FROM fans 
PRIMARY KEY (fan_id, actor_id);

CREATE MATERIALIZED VIEW fans_by_actor
AS SELECT actor_id, fan_id, fan_attr_1, fan_attr_2
FROM actors 
PRIMARY KEY (actor_id, fan_id);

In versions prior to 3.0, create secondary indices and evaluate if their performance is acceptable. Later, after upgrading to 3.x, just drop the secondary indexes and create materialized views.

Brad Schoening
  • 1,281
  • 6
  • 22
  • Thanks for the interesting approach. However, we use an older version of Cassandra (2.2.7) that doesn't support materialized views :/ I can't imagine myself being the only person facing this issue, I wonder how other SWE's solve this issue. Secondary indices would be an option, but it's not suited for data with high cardinality. – James Maa Dec 31 '16 at 19:48
  • MV are now shown to be a leaky abstraction ; https://www.mail-archive.com/dev@cassandra.apache.org/msg11511.html Please check before use – Alex Punnen Jul 23 '18 at 06:14
0

The way you solve these kind of problems is to manually update all the changed records.

Since you can't use materialized views, in order to update fan_attr_1 on your data you need to:

  1. Update the fan table by issuing UPDATE fan ... WHERE fan_id = xxx.
  2. Select all the actor_ids from the actors_by_fan by issuing SELECT actor_id ... WHERE fan_id = xxx.
  3. Update all the corresponding rows in the fans_by_actor table by issuing UPDATE fans_by_actor ... WHERE actor_id IN (...), or alternatively loop over the actor_ids and run each update async.

As long as you have a small amount of actor_id in the step 2, say less than 20, you can group all the queries and maintain strong consistency between tables by running them in a single BATCH. You need to guarantee the consistency between tables in other way otherwise.

This can be as inefficient as it sounds, but I don't think there are other smarter solutions. By the way, you are issuing one read (the step 2) and multiple writes (step 1 and step 3). This won't be the end of the world, especially if you don't change attributes so often (eg every 10 milliseconds).

xmas79
  • 5,060
  • 2
  • 14
  • 35