How does Apache Cassandra perform on a single read of millions of records?

Question

Much has been written about how Cassandra's redundancy provides good performance for thousands of incoming requests from different locations, but I haven't found anything on the throughput of a single big request. That's what this question is about.

I am assessing Apache Cassandra's potential as a database solution to the following problem:

The client would be a single-server application with exclusive access to the Cassandra database, co-located in the same datacentre. The Cassandra instance might be a few nodes, but likely not more than 5.

When a certain feature runs on the application (triggered occasionally by a human) it will populate Cassandra with up to 5M records representing short arrays of float data, as well as delete such records. The records will not be updated and we never need to access individual elements of an array. The arrays can be of different lengths, but will typically have around 100 elements, and each row might represent 0-20 arrays.

For example:

id   array1                  array2
123  [1.0, 2.5, ..., 10.8]   [0.0, 0.5, ..., 1.0]

Bonus question: Should I use a list of doubles to represent this, or should I serialize the arrays to Json?

At some point the user requests a report and the server should read all 5M records, interpret the arrays, do some aggregation, and plot some data on the screen. Might the read operation take <1s, <10s, <100s? How can I estimate the throughput in this case, assuming it is the bottleneck?

It’ll perform terribly. This is not a good use case for Cassandra. — Aaron, Feb 12 '22 at 11:14
As Aaron mentioned, this will not be a good use case for Cassandra, I would test it out on a document based NoSQL like Mongo or Couchbase. For the reporting part, you can also consider Couchbase Analytics server which has a MPP engine. The other reason for choosing one of the document oriented NoSQL is that you can index the array if required. Although this will need a careful design of data modelling — Rajib Deb, Feb 12 '22 at 20:58
@Aaron A while ago you answered a question about querying different partitions separately using an async "future". Could that apply here (if someone insisted on archtecting a system around this)? https://stackoverflow.com/questions/36690811/what-is-the-best-way-to-read-data-from-cassandra-in-parallel — Alejandro, Feb 14 '22 at 20:06
So I've thought about this question for a while now. You know, if you could find a good number of threads to process concurrently (not overwhelming Cassandra, but not taking forever), it might be ok. It's worth trying, for sure. — Aaron, Feb 16 '22 at 16:44

score 0 · Answer 1 · answered Feb 13 '22 at 16:55

Let me start with your second use case, As your data is distributed across the nodes if you have a broad range query without having a narrowed down partition, Cassandra is going to perform slow.

Cassandra is well suitable for Querying and suitable for searching if you know the partition key.

Even you are having a 5M records, Assuming this gets scattered around 5 different nodes, For your reporting use case Cassandra has to go through all the nodes and aggregate it. Eventually it gets timed out.
This specific use case is not viable in Cassandra but if you can
aggregate in your service and make multiple calls to partition and
bucket. it is going to perform super fast.

Generally, the accessing pattern matters, Read wins. The data can be formatted in any form but reading it wisely is matters to Cassandra. So answered your second part. Thank you.

I am not sure what you mean. Are you suggesting that in order to query faster the client/service should know how what nodes the partition key (id in my example) maps to so that I can pre-partition my query in the service? I can see that as a workaround for an existing system, but I won't be designing one around that. I don't even see how to do that in Datastax's C# API. — Alejandro, Feb 14 '22 at 19:16

How does Apache Cassandra perform on a single read of millions of records?

1 Answers1