1

I'm running a 2 GKE private cluster set up in europe-west2. I have a dedicated config cluster for MCI and a worker cluster for workloads. Both clusters are registered to Anthos hub and ingress feat enabled on config cluster. In addition worker cluster runs latest ASM 1.12.2.

As far as MCI is concerned my deployment is 'standard' as in based on available docs (ie https://cloud.google.com/architecture/distributed-services-on-gke-private-using-anthos-service-mesh#configure-multi-cluster-ingress, terraform-example-foundation repo etc).

Everything works but I'm hitting an intermittent connectivity issue no matter how many times I redeploy entire stack. My eyes are bleeding from staring at logging dashboard. I ran out of dots to connect.

I'm probing some endpoints presented from my cluster which most of the time returns 200 with following logged under resource.type="http_load_balancer":

{
httpRequest: {
 latency: "0.081658s"
 remoteIp: "20.83.144.189"
 requestMethod: "GET"
 requestSize: "360"
 requestUrl: "https://foo.bar.io/"
 responseSize: "1054"
 serverIp: "100.64.72.136"
 status: 200
 ...
}
insertId: "10mjvz4e8g0nq"
jsonPayload: {
 @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
 statusDetails: "response_sent_by_backend"
}
...
resource: {
 labels: {
  backend_service_name: "mci-4z8mmz-80-asm-ingress-mcs-istio"
  forwarding_rule_name: "mci-4z8mmz-fws-asm-ingress-mci-istio"
  project_id: "prj-foo-bar"
  target_proxy_name: "mci-4z8mmz-asm-ingress-mci-istio"
  url_map_name: "mci-4z8mmz-asm-ingress-mci-istio"
  zone: "global"
 }
 type: "http_load_balancer"
}
severity: "INFO"
spanId: "2a986abfc69bef6f"
timestamp: "2022-02-04T15:24:14.160642Z"
...
}

At random intervals, anything between 1 - 5 hours the probes start failing with 404 for a period of 5 - 10 mins and following is logged:

{
httpRequest: {
 ...
 requestMethod: "GET"
 ...
 requestUrl: "https://foo.bar.io/"
 ...
 status: 404
 ...
}
insertId: "10mjvz4e8g0nq"
jsonPayload: {
 @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry"
 statusDetails: "internal_error"
}
...
resource: {
 labels: {
  backend_service_name: ""
  forwarding_rule_name: "mci-4z8mmz-fws-asm-ingress-mci-istio"
  project_id: "prj-foo-bar"
  target_proxy_name: "mci-4z8mmz-asm-ingress-mci-istio"
  url_map_name: "mci-4z8mmz-asm-ingress-mci-istio"
  zone: "global"
 }
 type: "http_load_balancer"
}
severity: "WARNING"
...
}

backend_service_name and serverIp disappears and the external LB provisioned via MCI goes for an extended nap. If I try to access the endpoints in a browser during that period i get 404'd and eventually connection was closed.

I've searched logs far and wide and cannot find any leads.

Has anyone experienced a similar issue ? Could this be a regional thing ? I'm yet to try deploying to another region.

Any info/links/ideas much appreciated.

Edit:

I also confirmed that health checks are fine and there are no transitions. Pods never receive the request so 404's are coming from external lb.

red_guy
  • 13
  • 3
  • How does your definition of the MultiClusterIngress Look like? – user140547 Feb 05 '22 at 13:36
  • Refer this [documentation](https://cloud.google.com/kubernetes-engine/docs/how-to/troubleshooting-and-ops) for troubleshooting the errors of MIC and try to Re-authenticate to the Google Cloud CLI using `gcloud auth login` – Goli Nikitha Feb 06 '22 at 08:46
  • @GoliNikitha thanks but posting links to generally available documentation adds no new knowledge with respect to issue described. – red_guy Feb 06 '22 at 10:40

1 Answers1

1

I had a same/similar issue when using a HTTPS with MultiClusterIngress.

Google support suggested to use a literal static IP for the annotation:

networking.gke.io/static-ip: STATIC_IP_ADDRESS

Try using a literal IP like

34.102.201.47

Instead of

https://www.googleapis.com/compute/v1/projects/PROJECT_ID/global/addresses/ADDRESS_NAME

as described in https://cloud.google.com/kubernetes-engine/docs/how-to/multi-cluster-ingress#static

If it doesn't solve the issue, try contacting Google Support

user140547
  • 7,750
  • 3
  • 28
  • 80
  • Interesting. I am indeed ref-ing resource uri in static-ip annotation. Will give that a go and confirm. I really appreciate it ! – red_guy Feb 05 '22 at 16:41
  • Absolutely spot on. Not a single 404 over last 24hr. With this very intimate nugget of information you saved me hours in code review and researching alternative approaches @user140547. Not to mention whatever is left of my sanity. Thank you ! – red_guy Feb 06 '22 at 10:45