Posting this out of comments as community wiki for better visibility, feel free to edit and expand.
Best option to solve this question is to use istio - Locality Load Balancing
. Major points from the link:
A locality defines the geographic location of a workload instance
within your mesh. The following triplet defines a locality:
Region: Represents a large geographic area, such as us-east. A region typically contains a number of availability zones. In
Kubernetes, the label topology.kubernetes.io/region determines a
node’s region.
Zone: A set of compute resources within a region. By running services in multiple zones within a region, failover can occur between
zones within the region while maintaining data locality with the
end-user. In Kubernetes, the label topology.kubernetes.io/zone
determines a node’s zone.
Sub-zone: Allows administrators to further subdivide zones for more fine-grained control, such as “same rack”. The sub-zone concept
doesn’t exist in Kubernetes. As a result, Istio introduced the custom
node label topology.istio.io/subzone to define a sub-zone.
That means that a pod running in zone bar
of region foo
is not
considered to be local to a pod running in zone bar
of region baz
.
Another option that can be considered with traffic balancing adjusting is suggested in comments:
use nodeAffinity
to achieve consistency between scheduling pods
and nodes
in specific "regions".
There are currently two types of node affinity, called
requiredDuringSchedulingIgnoredDuringExecution
and
preferredDuringSchedulingIgnoredDuringExecution
. You can think of
them as "hard" and "soft" respectively, in the sense that the former
specifies rules that must be met for a pod to be scheduled onto a node
(similar to nodeSelector but using a more expressive syntax), while
the latter specifies preferences that the scheduler will try to
enforce but will not guarantee. The "IgnoredDuringExecution" part of
the names means that, similar to how nodeSelector works, if labels on
a node change at runtime such that the affinity rules on a pod are no
longer met, the pod continues to run on the node. In the future we
plan to offer requiredDuringSchedulingRequiredDuringExecution
which
will be identical to requiredDuringSchedulingIgnoredDuringExecution
except that it will evict pods from nodes that cease to satisfy the
pods' node affinity requirements.
Thus an example of requiredDuringSchedulingIgnoredDuringExecution
would be "only run the pod on nodes with Intel CPUs" and an example
preferredDuringSchedulingIgnoredDuringExecution
would be "try to run
this set of pods in failure zone XYZ, but if it's not possible, then
allow some to run elsewhere".
Update: based on @mirekphd comment, it will still not be fully functioning in a way it was asked to:
It turns out that in practice Kubernetes does not really let us switch
off secondary zone, as soon as we spin up a realistic number of pod
replicas (just a few is enough to see it)... they keep at least some
pods in the other zone/DC/region by design (which is clever when you
realize that it removes the dependency on the docker registry
survival, at least under default imagePullPolicy for tagged images),
GibHub issue #99630 - NodeAffinity preferredDuringSchedulingIgnoredDuringExecution doesn't work well
Please refer to @mirekphd's answer