Much has been written about how Cassandra's redundancy provides good performance for thousands of incoming requests from different locations, but I haven't found anything on the throughput of a single big request. That's what this question is about.
I am assessing Apache Cassandra's potential as a database solution to the following problem:
The client would be a single-server application with exclusive access to the Cassandra database, co-located in the same datacentre. The Cassandra instance might be a few nodes, but likely not more than 5.
When a certain feature runs on the application (triggered occasionally by a human) it will populate Cassandra with up to 5M records representing short arrays of float data, as well as delete such records. The records will not be updated and we never need to access individual elements of an array. The arrays can be of different lengths, but will typically have around 100 elements, and each row might represent 0-20 arrays.
For example:
id array1 array2
123 [1.0, 2.5, ..., 10.8] [0.0, 0.5, ..., 1.0]
Bonus question: Should I use a list
of doubles to represent this, or should I serialize the arrays to Json?
At some point the user requests a report and the server should read all 5M records, interpret the arrays, do some aggregation, and plot some data on the screen. Might the read operation take <1s, <10s, <100s? How can I estimate the throughput in this case, assuming it is the bottleneck?