I am often faced with the same problem as you do @frisky5 where i would like to achieve a "suboptimal" HA system using only 2 nodes, and thus workarounds are always needed with cloud-native frameworks that rely on the assumption that clusters will have lot of nodes available.
That ain't always the case in real life, is it ;) ?
That being said, i see you essentially having 2 options:
- Externalize zookeeper configuration on a replicated storage system using 2 nodes (e.g. DRBD)
- Replicate Kafka data volumes entirely on the second nodes and use 2 one-node Kafka clusters that you switch on and off depending on who is the current master node.
I would go for the first option. In that case you would have 2 Kafka servers and one zookeeper server whose ip needs to be static (virtual ip). When the zookeeper node goes down, it is restarted one the second node with same VIP, but it needs to access the synchronized data folder.
I am not too familiar with zookeepers internals and i can't tell you whether it will go in conflict when starting up on a data store who "wasn't its own" but i would guess it makes sense for you to test it using a simple rsync setup.
Another way to achieve consensus if you are using a k3s based kubernetes cluster would be to rely on internal k8s distributed consensus mechanics to "tell Kafka" which node is the leader. This works for the postgresoperator by chruncydata because Patroni is cool ( https://patroni.readthedocs.io/en/latest/kubernetes.html ) but i am not sure if Kafka/zookeeper are that flexible and can communicate with a rest API to set their locks ...
Once you have achieved this intermediate step, then you can use a PostgreSQL db as external source of truth for k3s and then it is as simple as syncing the postgres data folder between the machines (easily done with rsync). The beauty of this approach is that it is way more generic and could be used for other systems too.
Let me know what do you think about these two approaches and whether you manage to setup a test environment. If you do on GitHub i can help you out with implementation