3

Backgound: we are using Cassandra to store some time series data and we are using prepared statements to access data.

We are partitioning data in tables by:

  • time period (like one week or one month) and
  • retention policy (like 1 year, 5 or 10 years)

Having different tables we need to prepare (only upon usage) a different statement for every combination of query, time period and retention policy, so we will have an explosion in number of prepared statements. Some math:

timePeriods = 12..52 * yearsOfData
maxNumOfPrepStatements = timePeriods * policies * numOfQueries

ourCase => (20 * 10 y) * 10 p * 10 q = 20.000 prep statements

On client side I can keep in cache only the most used PS, but I could not find a way to remove the unused ones from the server, so I am worried that having about 20.000 prepared statements could be a big cost for every node.

Problem: will this number of PS cause any problem on the server?

This breaks into smaller questions:

  • How much will be the server side cost of those prepared statements?
  • Will the server keep all the PS or will it remove the less used ones?
  • Is there a better solution than restarting Cassandra nodes to clean the PS cache?
  • using the Java client, will closing the Session / Cluster object alleviate this (server side)?
RobMcZag
  • 605
  • 9
  • 15
  • Please help me on the below issue https://stackoverflow.com/questions/52134514/cassandra-querying-reducing-performance/52139204?noredirect=1#comment91235494_52139204 – shantha ramadurga Sep 03 '18 at 16:51

1 Answers1

7

How much will be the server side cost of those prepared statements?

Each prepared statement will be parsed and further stored in a cache using it's MD5 digest as key. Identical prepare statements, that the client is about the re-register, will cause the server to match the MD5 digest against already existing statements and should therefor be avoided. Executing already registered statements will have the client send the MD5 along with the query arguments to the server and the server is able to retrieve the cached statement using the MD5, which is faster to execute compared to parsing a regular CQL statement. Each cached statement will also consume part of the Java heap which corresponds to the total size of the MD5 key and representation of the statement object.

Will the server keep all the PS or will it remove the less used ones?

Prepared statements are managed by the server by creating a cache based on ConcurrentLinkedHashMap. The cache's capacity depends on the available memory: Runtime.getRuntime().maxMemory() / 256. Entries are weighted by their memory usage as well and large statements will be evicted first from the cache in case the capacity has been reached. You can monitor this behavior using the org.apache.cassandra.metrics.CQL.PreparedStatementsEvicted JMX metric.

Is there a better solution than restarting Cassandra nodes to clean the PS cache?

Not that I'm aware of. I'm also not really sure why you'd like to do that as identical MD5 digests will be created for identical queries. Please also notice that the Java client will automatically re-register prepared statements that cannot be found on the server, e.g. in case it has been evicted from the cache (see also this answer).

using the Java client, will closing the Session / Cluster object alleviate this (server side)?

I don't think so. The server would have to keep track of which statements have been registered by the hundred of potential clients in order to clean them up safely.

Community
  • 1
  • 1
Stefan Podkowinski
  • 5,206
  • 1
  • 20
  • 25
  • Thank you @stefan-podkowinski, this clarified much of the behind the scenes and looks like we can not do much more than trusting the server on keeping a reasonable amount of PS. We will definitely need some testing. Any clue if our expected 20.000 PS could be an heavy load ? BTW The idea of restarting was connected to keeping only the most recent PS on the clients, so only those would be sent back. – RobMcZag Nov 10 '15 at 09:11