2

We run a Kubernetes-compatible (OKD 3.11) on-prem / private cloud cluster with backend apps communicating with low-latency Redis databases used as caches and K/V stores. The new architecture design is about to divide worker nodes equally between two geographically distributed data centers ("regions"). We can assume static pairing between node names and regions, an now we have added labeling of nodes with region names as well.

What would be the recommended approach to protect low-latency communication with the in-memory databases, making client apps stick to the same region as the database they are allowed to use? Spinning up additional replicas of the databases is feasible, but does not prevent round-robin routing between the two regions...

Related: Kubernetes node different region in single cluster

mirekphd
  • 4,799
  • 3
  • 38
  • 59
  • 1
    (Only) Part of the solution is to achieve workload colocation by using node affinity/antiaffinity in your worloads. k8s Affinity and anti-affinity doc here: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity. Now there is the LB problem between the apps and bd. use one service per database replica? – titou10 Nov 17 '21 at 21:07
  • @titou10 - a convergent solution: node affinity is what I found in OKD/OCP docs too. I also recommended labeling the nodes in all centers (`$ oc label node =`), because that will reduce the number of pods required to run (affinity relying on node names would require for HA running at least as many pods as there are nodes, pods being 'pinned' to nodes, rather than as few as there are regions). 1 DB per region seems sufficient IMO – mirekphd Nov 17 '21 at 23:10
  • 1
    Well, the **best** option would be using `istio` feature - [Locality Load Balancing](https://istio.io/latest/docs/tasks/traffic-management/locality-load-balancing/) which can help you to achieve exactly what you're looking for. – moonkotte Nov 18 '21 at 13:17
  • Is nodeSelector work in your case ? – Gawain Nov 19 '21 at 09:36
  • @Gawain: `nodeSelector` is riskier in case of zone/region failures than `nodeAffinity` (`preferred*` version, unlike `required*` one permits eviction of pods to another node/region if the preferred one becomes unavailable. This was also recommended by Red Hat) – mirekphd Nov 19 '21 at 10:14

2 Answers2

1

Posting this out of comments as community wiki for better visibility, feel free to edit and expand.


Best option to solve this question is to use istio - Locality Load Balancing. Major points from the link:

A locality defines the geographic location of a workload instance within your mesh. The following triplet defines a locality:

  • Region: Represents a large geographic area, such as us-east. A region typically contains a number of availability zones. In Kubernetes, the label topology.kubernetes.io/region determines a node’s region.

  • Zone: A set of compute resources within a region. By running services in multiple zones within a region, failover can occur between zones within the region while maintaining data locality with the end-user. In Kubernetes, the label topology.kubernetes.io/zone determines a node’s zone.

  • Sub-zone: Allows administrators to further subdivide zones for more fine-grained control, such as “same rack”. The sub-zone concept doesn’t exist in Kubernetes. As a result, Istio introduced the custom node label topology.istio.io/subzone to define a sub-zone.

That means that a pod running in zone bar of region foo is not considered to be local to a pod running in zone bar of region baz.


Another option that can be considered with traffic balancing adjusting is suggested in comments:

use nodeAffinity to achieve consistency between scheduling pods and nodes in specific "regions".

There are currently two types of node affinity, called requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as "hard" and "soft" respectively, in the sense that the former specifies rules that must be met for a pod to be scheduled onto a node (similar to nodeSelector but using a more expressive syntax), while the latter specifies preferences that the scheduler will try to enforce but will not guarantee. The "IgnoredDuringExecution" part of the names means that, similar to how nodeSelector works, if labels on a node change at runtime such that the affinity rules on a pod are no longer met, the pod continues to run on the node. In the future we plan to offer requiredDuringSchedulingRequiredDuringExecution which will be identical to requiredDuringSchedulingIgnoredDuringExecution except that it will evict pods from nodes that cease to satisfy the pods' node affinity requirements.

Thus an example of requiredDuringSchedulingIgnoredDuringExecution would be "only run the pod on nodes with Intel CPUs" and an example preferredDuringSchedulingIgnoredDuringExecution would be "try to run this set of pods in failure zone XYZ, but if it's not possible, then allow some to run elsewhere".

Update: based on @mirekphd comment, it will still not be fully functioning in a way it was asked to:

It turns out that in practice Kubernetes does not really let us switch off secondary zone, as soon as we spin up a realistic number of pod replicas (just a few is enough to see it)... they keep at least some pods in the other zone/DC/region by design (which is clever when you realize that it removes the dependency on the docker registry survival, at least under default imagePullPolicy for tagged images), GibHub issue #99630 - NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well

Please refer to @mirekphd's answer

moonkotte
  • 3,661
  • 2
  • 10
  • 25
  • 1
    It turns out that in practice Kubernetes does not really let us switch off secondary zone, as soon as we spin up a realistic number of pod replicas (just a few is enough to see it)... they keep at least some pods in the other zone/DC/region by design (which is clever when you realize that it removes the dependency on the docker registry survival, at least under default `imagePullPolicy` for tagged images)... see: https://github.com/kubernetes/kubernetes/issues/99630#issuecomment-790740081 – mirekphd Nov 19 '21 at 18:44
1

So effective region-pinning solution is more complex than just using nodeAffinity in the "preferred" version. This alone will cause you a lot of unpredictable surprises due to the opinionated character of Kubernetes that has zone spreading hard-coded, as seen in this Github issue, where they clearly try to put at least some eggs in another basket and see zone selection as an antipattern.

In practice the usefulness of nodeAffinity alone is restricted to a scenario with a very limited number of pod replicas, because when the pods number exceeds the number of nodes in a region (i.e. typically for the 3rd replica in a 2-nodes / 2-regions setup), scheduler will start "correcting" or "fighting" with user preference weights (even as unbalanced of 100:1) very much in favor of spreading, placing at least one "representative" pod on every node in every region (including the non-preferred ones with minimum possible weights of 1).

But this default zone spreading issue can be overcome if you create a single-replica container that will act as a "master" or "anchor" (a natural example being a database). For this single-pod "master" nodeAffinity will still work correctly - of course in the HA variant, i.e. "preferred" not "required" version. As for the remaining multi-pod apps, you use something else - podAffinity (this time in the "required" version), which will make the "slave" pods follow their "master" between zones, because setting any pod-based spreading disables the default zone spreading. You can have as many replicas of the "slave" pods as you want and never run into a single misplaced pod (at least at schedule time), because of the "required" affinity used for "slaves". Note that the known limitation of nodeAffinity applies here as well: the number of "master" pod replicas must not exceed the number of nodes in a region, or else "zone spreading" will kick in.

And here's an example of how to label the "master" pod correctly for the benefit of podAffinity and using a deployment config YAML file: https://stackoverflow.com/a/70041308/9962007

mirekphd
  • 4,799
  • 3
  • 38
  • 59