I setup the citus 11 cluster using the official docker and docker compose instruction. I scaled the cluster to 5 worker nodes and created a distributed table, with replication factor 3.
I expected the Citus cluster to work if I shutdown one worker node. But when I test that, by stopping one worker. The whole table stop to work. Simple query like select * or select count(*) will block until I restart the worker node.
So, my question is, does this mean, Citus now does not have built-in high availability function? I am confused about the sharding, replication and distributed query engine. Theoretically, even one node is down, there are still 2 copies of the data in the cluster. Citus can easily find those replication shard and query data from there. And this is how most distributed database work.
If that's the truth, increasing the server number will dramatically increase the failure rate of the whole cluster and then if I use the old hot-standby node to replicate each worker, that's a big increase of budget.
Please tell me I am wrong. I am trying to plan for a SaaS service and the relational database is now the only bottleneck of the system. So I am trying to make a "scalable solution" with sophisticate backup/restore and highly available functionality.