4

I am using dev cassandra cluster (4 nodes, each has 4 cores, 6GB RAM, 30-40GB of storage (NVME/SSD) on differents SANs).

The code used for a connection (.net core 3.1) and is DI via Singleton

  _cluster = Cluster.Builder()
            .AddContactPoints(cassandraHost.ToString().Split(','))

             //  .WithLoadBalancingPolicy(new RoundRobinPolicy())
            .WithPoolingOptions(new PoolingOptions().SetHeartBeatInterval(50000))
            .WithReconnectionPolicy(new ConstantReconnectionPolicy(1000))
            .Build()
            ;
        _sessions = new ConcurrentDictionary<string, ISession>();

and to get a session:

  if (!_sessions.ContainsKey(keyspaceName))
            {
               // var session = _cluster.Connect(keyspaceName);
                _sessions.GetOrAdd(keyspaceName, _cluster.Connect(keyspaceName));
                //var session = ;
            }
                //_sessions.GetOrAdd(keyspaceName, key => new Lazy<ISession>(() =>
                //    _cluster.Connect(key)));

            var result = _sessions[keyspaceName];

            return result;

Error that I see after 3-30 minutes:

Cassandra.NoHostAvailableException: All hosts tried for query failed (tried 10.30.200.4:9042; 10.30.200.2:9042; ...), see Errors property for more info

Interesting twist, same code works fine in production, but cannot make it work on development.

In terms of load, there is read every 0.2s (~ 1500 records), and around ~ 1000 inserts per second (dev).

Can't figure out why on my dev machine the code is crashing after random X minutes.

The nodes are accessible at the time of the app crash are working fine (as soon as I restart the app it will start working again just fine).

I am trying to crack this now maybe for a week and still can't figure that out :/

Thanks in advance for any direction.

Here is the log (Serilog => Sentry).

assandra.NoHostAvailableException: All hosts tried for query failed (tried 
10.30.200.1:9042: OperationTimedOutException 'The host 10.30.200.1:9042 did 
not reply before timeout 12000ms')
Module "Cassandra.Requests.RequestHandler", in GetNextValidHost
Module "Cassandra.Requests.RequestExecution", in Start
Module "Cassandra.Requests.RequestExecution", in RetryExecution
Module "System.Runtime.ExceptionServices.ExceptionDispatchInfo", in Throw
Module "System.Runtime.CompilerServices.TaskAwaiter", in 
HandleNonSuccessAndDebuggerNotification
 Module "System.Runtime.CompilerServices.TaskAwaiter`1", in GetResult
Mike
  • 187
  • 16
  • Without logging information it will be hard to figure out the cause. https://docs.datastax.com/en/developer/csharp-driver/3.14/faq/#how-can-i-enable-logging-in-the-driver Also, the NoHostAvailableException should have a dictionary that contains the underlying exceptions that were thrown for each host, can you show us that information as well? – João Reis May 05 '20 at 09:36
  • Hmmm the logs has nothing else, just what I've wrote already (however I've used logging to console)... However it only happens on WIndows machine (incl. docker), when run in Linux (dotnet / docker) it seems to be working fine... Any idea ? There is a good amount of connections, but they are using a Singleton session... – Mike May 06 '20 at 07:10
  • what log level are you using? try to set the log level to Information or even Verbose. Also, as I said that exception should contain information about the underlying exceptions that triggered it. You should probably edit your question and add the code that you are using to catch and print that exception (and the code that configures the driver logging as well) – João Reis May 06 '20 at 13:27
  • If you're using the Trace API, you can also try to set Cassandra.Diagnostics.CassandraStackTraceIncluded = true – João Reis May 06 '20 at 13:29
  • I've updated the log, but it is pretty much it, just did not reply before timeout 12000ms... Happens only on Windows/Windows docker.... Looks like heartbeat is not working as this happens when worker app (3.1) is throwing exception in the middle of the process (for some reason it tries to reconnect despite idle but it won't work as after 1-2 minutes when process restart it throws those error about inaccessible hosts). – Mike May 06 '20 at 19:42
  • hmm if it's just a timeout then we don't really know much, you should use the cassandra node logs and nodetool commands to troubleshoot and figure out what is going on. Can you clarify the last part of your comment about heartbeat? You mean that the hosts are unresponsive but the app keeps trying to send requests to them because the heartbeat is not causing them to be marked as "DOWN" by the driver? – João Reis May 06 '20 at 22:33
  • Well it is for 100% issue on the driver side on Windows, as I have other apps running on Linux hitting same cluster & keyspace xx times a second. I suspect actually Coravel library (Cassandra is a repo for a core code that is access via Coravel invocable) as it seems that Coravel has issues with running reliably and I suspect that it might block the thread and make impossible for Cassandra driver to use heartbeat to keep connection live. The reason is that I saw already that Coravel made Worker class ceasing to work despite using different thread... I will rewrite the Coravel to custom class. – Mike May 07 '20 at 08:40
  • Is the application sending requests continuously or is it idle? The heartbeat wouldn't really matter if there is there is a read every 0.2s (~ 1500 records), and around ~ 1000 inserts per second like you said in the question. In any case you can see if the driver is sending the heartbeats by checking the logs for a message in VERBOSE level like this "Connection idling, issuing a Request to prevent idle disconnects" – João Reis May 07 '20 at 09:55
  • You mentioned several connections, you might want to stop using GetOrAdd because it will call `_cluster.Connect` multiple times until the value is on the dictionary and only one of the created sessions will be stored on the dictionary (therefore leaking sessions and connections). See https://stackoverflow.com/q/61366140/10896275 – João Reis May 07 '20 at 09:56
  • It is already using GetOrAdd concurrent dictionary with sessions so there is one session per one keyspace. – Mike May 07 '20 at 11:59
  • Yes but GetOrAdd will create multiple sessions if there are a lot of concurrent calls for the same keyspace, see the answers on the stackoverflow question that I linked in the previous comment – João Reis May 07 '20 at 15:20
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/213344/discussion-between-joao-reis-and-mike). – João Reis May 07 '20 at 15:33
  • Interesting issue! I also had similar issue. It works ok in production but has problems in our dev environment. In my case I have issue with writes. Writes also time out while other clients CAN do writes. One more interesting thing: when such write failure happens it start happening with some particular record only. I.e. if you change it a bit, by adding or removing 1 char it will insert ok. Otherwise cassandra csharp driver will fail to insert it due to timeout. Even after restart and you insert this only record. At the same time you can insert that record via e.g. DevCenter. – Stanislav Berkov Jan 23 '21 at 17:26
  • If you're able to reproduce this issue in a somewhat consistent manner it would be good to ask a new SO question, sharing with us the code, schema and requests so that we can reproduce it as well. You should also share logs, exception stacktraces, environment and cluster info, what libraries and frameworks are being used by the application besides the DataStax C# driver etc. – João Reis Jan 23 '21 at 21:38

1 Answers1

1

I had similar issue with Cassandra: it failed to insert particular record (see Cassandra C# driver - Windows debug, after few minutes I get 'All hosts tried for query failed (tried xxxx). 4 nodes, works fine in prod). Finally I realized it was happening due to a network issue. This did not happen when I was connecting to Cassandra directly from office network. Issue occurred only when I was connecting to Cassandra through VPN. Something related to max packet size.

Stanislav Berkov
  • 5,929
  • 2
  • 30
  • 36