Cassandra assynchronous execution in multiple processes blocking synchronous requests

Question

I have an application that reads a series of XML files containing logs of vehicles passages in a road. The application then processes each record, transform a few of the informations to match the database columns and inserts it into a cassandra database (running a single node in a remote server [it's in an internal network so connection isn't really an issue]). After inserting data in the database, the process for each file then goes on to read this data and produce information for summary tables, that leaves information ready for a drilldown analysis made in an unrelated part of the application.

I'm using multiprocessing to process many XML files in parallel, and the trouble I'm having is with communicating to the cassandra server. Schematically, the process goes as follows:

Read record from XML file
Process record's data
insert processed data into the database (using .execute_async(query))
repeat 1 to 3 until the XMl file is over
Wait for the responses of all the insert queries I made
Read data from the database
Process the read data
Insert the processed data in summary tables

Now, this is running smoothly in multiple parallel processes, until, when one process goes on to step 6, its request (that's made using .execute(query), meaning I'll wait for the response) is always facing a timeout. The error I receive is:

 Process ProcessoImportacaoPNCT-1:
Traceback (most recent call last):
  File "C:\Users\Lucas\Miniconda\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Users\Lucas\PycharmProjects\novo_importador\app\core\ImportacaoArquivosPNCT.py", line 231, in run
    core.CalculoIndicadoresPNCT.processa_equipamento(sessao_cassandra, equipamento, data, sentido, faixa)
  File "C:\Users\Lucas\PycharmProjects\novo_importador\app\core\CalculoIndicadoresPNCT.py", line 336, in processa_equipamento
    desvio_medias(sessao_cassandra, equipamento, data_referencia, sentido, faixa)
  File "C:\Users\Lucas\PycharmProjects\novo_importador\app\core\CalculoIndicadoresPNCT.py", line 206, in desvio_medias
    veiculos = sessao_cassandra.execute(sql_pronto)
  File "C:\Users\Lucas\Miniconda\lib\site-packages\cassandra\cluster.py", line 1594, in execute
    result = future.result(timeout)
  File "C:\Users\Lucas\Miniconda\lib\site-packages\cassandra\cluster.py", line 3296, in result
    raise self._final_exception
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}

I have changed the timeout in the server to absurd amounts of time (500000000 ms for instance), and I have also attempted setting the timeout limit in the client, with .execute(query, timeout=3000) but still, no success.

Now, when more processes hit the same problem and the intense writing from steps 1-3 in multiple processes stops, the last processes to get to step 6 have success in following the procedure, which makes me think the problem is that cassandra is giving priority to the tens of thousands of insert requests I'm asking per second and either ignoring my read request or putting it way back in the line.

A way to solve this, in my opinion, would be if in any way I could ask cassandra to give priority to my read request so that I can keep processing, even if that means slowing down the other processes.

Now, as a side note, you might think my process modelling is not optimal, and I'd love to hear opinions on that, but for the reality of this application this is, in our vision, the best way to proceed. So we have actually thought extensively about optimising the process, but (if the cassandra server can handle it) this is optimal for our reality.

So, TL;DR: Is there a way of giving priority to a query when executing tens of thousands of assynchronous queries? If not, is there a way of executing tens of thousands of insert queries and read queries per second in a way that the requests don't timeout? additionally, what would you suggest I do to solve the problem? run less processes in parallel is obviously a solution but one I'm trying to avoid. So, Would love to hear everyone's thoughts.

Storing the data while inserting so I don't need to read it again for summary is not a possibility because the XML files are huge and memory is an issue.

score 5 · Accepted Answer · answered Aug 21 '15 at 10:06

I don't know of a way to give priority to read queries. I believe internally Cassandra has separate thread pools for read and write operations, so those are running in parallel. Without seeing the schema and queries you're doing, it's hard to say if you are doing a very expensive read operation or if the system is just so swamped with writes that it can't keep up with the reads.

You might want to try monitoring what's going on in Cassandra as your application is running. There are several tools you can use to monitor what's going on. For example, if you ssh to your Cassandra node and run:

watch -n 1 nodetool tpstats

This will show you the thread pool stats (updated once per second). You'll be able to see if the queues are filling up or operations are getting blocked. If any of the "Dropped" counters increase, that's a sign you don't have enough capacity for what you're trying to do. If that's the case, then add capacity by adding more nodes, or change your schema and approach so that the node has less work to do.

Other useful things to monitor (on linux use watch -n 1 to monitor continuously):

nodetool compactionstats
nodetool netstats
nodetool cfstats <keyspace.table name>
nodetool cfhistograms <keyspace> <table name>

It also good to monitor the node with linux commands like top and iostat to check the CPU utilization and disk utilization.

My impression from what you say is that your single node doesn't have enough capacity to do all the work you're giving it, so either you need to process less data per unit of time, or add more Cassandra nodes to spread out the workload.

I'm currently facing my own timeout error due to partitions having too many rows, so I may have to add cardinality to my partition key to make the contents of each partition smaller.

Thanks for the suggestion! I'll try using what you suggested to monitor my node's activity and decide if either more structure or less queries per time are needed. It's good to know this might be a structure problem because with cassandra adding more nodes is easy. Thanks Jim for your time, I'll comment again once we solve the problem. — Lucas Franceschi, Aug 21 '15 at 12:05
By the way, any ideas on why raising the server and client request timeout didn't work? — Lucas Franceschi, Aug 21 '15 at 12:09
I don't know enough of Cassandra internals to say, but there seem to be a lot of different meanings for timeout errors. Mostly likely I guess it would be a starvation situation where some operation sits in a queue too long and gets unceremoniously dropped. It's certainly frustrating where you are willing to wait for a long operation to complete, but Cassandra decides to just give up. — Jim Meyer, Aug 21 '15 at 12:20
Thanks to your feedback we had more arguments to decide this was an infrastructure issue rather than we doing something wrong. We've divided the procedure so that we don't flood cassandra with queries that much. Thanks for the feedback! — Lucas Franceschi, Aug 24 '15 at 14:48

Cassandra assynchronous execution in multiple processes blocking synchronous requests

1 Answers1

Linked