Why does Cassandra read much more data from disk than required?

Question

We are running Cassandra 3.0.16 on a cluster of i3.2xl instances in AWS. The volumes that store data are encrypted using Luks. We are running a job that needs to read 3TB of data from two tables by running individual queries on single record keys. If we watch Cloudwatch IO metrics for one of the Cassandra instances, it looks like Cassandra will read 1000's of terabytes before the job will finish. This is causing the job duration to be 6x slower than expected.

We have fully compacted the two tables being read and it only helped performance improve by 10%. We have ruled out encryption causing slowness by seeing the same slow performance on a cluster that does not have volumes encrypted.

Are there any Cassandra configuration settings that can be tuned to reduce excessive IO?

is the IO is high only on one cassandra instance or all nodes have same issue ? — Laxmikant, Mar 14 '19 at 05:24
could you also share your table definition and the query you run? — Mandraenke, Mar 14 '19 at 08:11
Laxmikant, IO is high on all of the cassandra instances when the queries are running. — Glen Ireland, Mar 14 '19 at 19:13
Mandraenke, we have two tables. The first table has a single varchar column PK and 7 other columns using set, timestamp, boolean, and varcher data types. We query the first table 24 billion times with a single PK value during our job. The second table has seven column composite PK and 7 other columns using decimal, timestamp, boolean and varchar data types. We query the second table 300 million times by specifying a single bigint value for the first column of the PK during our job. Sorry, Stackoverflow would not let me paste actual structures. — Glen Ireland, Mar 14 '19 at 19:59

Why does Cassandra read much more data from disk than required?

0 Answers0