2

I setup the citus 11 cluster using the official docker and docker compose instruction. I scaled the cluster to 5 worker nodes and created a distributed table, with replication factor 3.

I expected the Citus cluster to work if I shutdown one worker node. But when I test that, by stopping one worker. The whole table stop to work. Simple query like select * or select count(*) will block until I restart the worker node.

So, my question is, does this mean, Citus now does not have built-in high availability function? I am confused about the sharding, replication and distributed query engine. Theoretically, even one node is down, there are still 2 copies of the data in the cluster. Citus can easily find those replication shard and query data from there. And this is how most distributed database work.

If that's the truth, increasing the server number will dramatically increase the failure rate of the whole cluster and then if I use the old hot-standby node to replicate each worker, that's a big increase of budget.

Please tell me I am wrong. I am trying to plan for a SaaS service and the relational database is now the only bottleneck of the system. So I am trying to make a "scalable solution" with sophisticate backup/restore and highly available functionality.

PleaseLetMeGo
  • 125
  • 2
  • 7

1 Answers1

3

Citus has native high availability for read queries, but write queries will fail if one node is down (otherwise they'll end up with different data).

The first question I have is: How did you stop your worker? Did you stop the postgres process? Did you shut down the machine? Did you block all network traffic to it?

In the last two cases it's likely that the Linux kernel is not reporting the TCP socket as broken yet, so citus simply continues waiting for a response. This is something to consider in many distributed systems and requires configuring the client with decent TCP timeouts. This can be done for citus by putting the following in your postgresql.conf file and reloading/restarting postgres:

citus.node_conninfo = 'keepalives=1 keepalives_idle=30 keepalives_interval=15 keepalives_count=2 tcp_user_timeout=10000 connect_timeout=25 sslmode=require'

These settings do 3 things:

  1. Detect that idle cached connections are broken at most 60 seconds after the other side becomes unresponsive (30 + 15 * 2)
  2. If the other side doesn't acknowledge any TCP messages in 10 seconds the connection is marked as broken
  3. Allow 20 seconds to successfully establish a connection before the connection attempt is considered as failed.

If you want tighter timeouts you can change these values accordingly.

What is recommended?

In general though, for high availability we recommend using regular postgres high availability solutions like Patroni or pg_auto_failover. These can be used for HA for writes too. Some useful resources of this are: https://laravel.wiki/62687acd9a725.html

JelteF
  • 3,021
  • 2
  • 27
  • 35