2

I have a system that stores measurements from machines with many transducers, once per second. I'm considering using Cassandra and would like to store the 1 second sample of machine state measurements in a single table, which would be something like:

create table inst_samples (
    machine_id text,
    batch_id int,
    sample_time timestamp,
    var1 double,
    var2 double,
    .....
    varN double,
    PRIMARY KEY ((machine_id, batch_id), sample_time)
);

There are approximately 20 machines with 400 state variables each and the batch_id will update every 1-2 hours. I have reviewed the documentation on the 2 billion cells maximum per table and noted similar questions here What are the maximum number of columns allowed in Cassandra and here Cassandra has a limit of 2 billion cells per partition, but what's a partition?

If I am understanding this limit correctly I would hit the 2 billion cell limit for a single machine in the inst_samples table in approximately 60 days?

(2e9 cells / 400 cols/row) / (3600 rows / hour) / (24 hours / day) =~ 58 days?

I am a total Cassandra newbie. Thanks.

steeljav
  • 23
  • 2

1 Answers1

2

This 2 billion limit is for partition, and if you have good data model, you should have many partitions. In practice, it's recommended to keep number of cells per partition under control - something like, not more 100,000 cells per partition, otherwise there could be some performance problems, etc. But the actual limit depends on the multiple factors, like Cassandra version, what queries are executed, etc.

In your case, we have partition key of machine_id + batch_id, and that gives us for batch size of 2 hours: 400x7200 = 2880000 - almost 3 million cells. It may still work (would be better if you set batch size to 1 hour), but will require testing on real hardware - this could be done for example, with NoSQLBench.

There are also other ways to optimize your data model - for example, instead of allocating a separate column for every variable, just use frozen<map<text, double>> - in this case, all measurements will be stored as a single cell. The drawback of it - you can't change the individual values without reading the map & inserting it with changed value. Another drawback is that you'll need to read all measurements at once - but this could be ok.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132