2

I am currently trying to move my calico based clusters to the new Dataplane V2, which is basically a managed Cilium offering. For local testing, I am running k3d with open source cilium installed, and created a set of NetworkPolicies (k8s native ones, not CiliumPolicies), which lock down the desired namespaces.

My current issue is, that when porting the same Policies on a GKE cluster (with DataPlane enabled), those same policies don't work.

As an example let's take a look into the connection between some app and a database:

---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: db-server.db-client
  namespace: BAR
spec:
  podSelector:
    matchLabels:
      policy.ory.sh/db: server
  policyTypes:
    - Ingress
  ingress:
    - ports: []
      from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: FOO
          podSelector:
            matchLabels:
              policy.ory.sh/db: client
---
kind: NetworkPolicy
apiVersion: networking.k8s.io/v1
metadata:
  name: db-client.db-server
  namespace: FOO
spec:
  podSelector:
    matchLabels:
      policy.ory.sh/db: client
  policyTypes:
    - Egress
  egress:
    - ports:
        - port: 26257
          protocol: TCP
      to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: BAR
          podSelector:
            matchLabels:
              policy.ory.sh/db: server

Moreover, using GCP monitoring tools we can see the expected and actual effect the policies have on connectivity:

Expected: Expected

Actual: Actual

And logs from the application trying to connect to the DB, and getting denied:

{
  "insertId": "FOO",
  "jsonPayload": {
    "count": 3,
    "connection": {
      "dest_port": 26257,
      "src_port": 44506,
      "dest_ip": "172.19.0.19",
      "src_ip": "172.19.1.85",
      "protocol": "tcp",
      "direction": "egress"
    },
    "disposition": "deny",
    "node_name": "FOO",
    "src": {
      "pod_name": "backoffice-automigrate-hwmhv",
      "workload_kind": "Job",
      "pod_namespace": "FOO",
      "namespace": "FOO",
      "workload_name": "backoffice-automigrate"
    },
    "dest": {
      "namespace": "FOO",
      "pod_namespace": "FOO",
      "pod_name": "cockroachdb-0"
    }
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "project_id": "FOO",
      "node_name": "FOO",
      "location": "FOO",
      "cluster_name": "FOO"
    }
  },
  "timestamp": "FOO",
  "logName": "projects/FOO/logs/policy-action",
  "receiveTimestamp": "FOO"
}

EDIT:

My local env is a k3d cluster created via:

k3d cluster create --image ${K3SIMAGE} --registry-use k3d-localhost -p "9090:30080@server:0" \
            -p "9091:30443@server:0" foobar \
            --k3s-arg=--kube-apiserver-arg="enable-admission-plugins=PodSecurityPolicy,NodeRestriction,ServiceAccount@server:0" \
            --k3s-arg="--disable=traefik@server:0" \
            --k3s-arg="--disable-network-policy@server:0" \
            --k3s-arg="--flannel-backend=none@server:0" \
            --k3s-arg=feature-gates="NamespaceDefaultLabelName=true@server:0"

docker exec k3d-server-0 sh -c "mount bpffs /sys/fs/bpf -t bpf && mount --make-shared /sys/fs/bpf"
kubectl taint nodes k3d-ory-cloud-server-0 node.cilium.io/agent-not-ready=true:NoSchedule --overwrite=true
skaffold run --cache-artifacts=true -p cilium --skip-tests=true --status-check=false
docker exec k3d-server-0 sh -c "mount --make-shared /run/cilium/cgroupv2"

Where cilium itself is being installed by skaffold, via helm with the following parameters:

name: cilium
remoteChart: cilium/cilium
namespace: kube-system
version: 1.11.0
upgradeOnChange: true
wait: false
setValues:
  externalIPs.enabled: true
  nodePort.enabled: true
  hostPort.enabled: true
  hubble.relay.enabled: true
  hubble.ui.enabled: true

UPDATE: I have setup a third environment: a GKE cluster using the old calico CNI (Legacy dataplane) and installed cilium manually as shown here. Cilium is working fine, even hubble is working out of the box (unlike with the dataplane v2...) and I found something interesting. The rules behave the same as with the GKE managed cilium, but with hubble working I was able to see this:

Hubble db connection

For some reason cilium/hubble cannot identify the db pod and decipher its labels. And since the labels don't work, the policies that rely on those labels, also don't work.

Another proof of this would be the trace log from hubble:

kratos -> db

Here the destination app is only identified via an IP, and not labels.

The question now is why is this happening?

Any idea how to debug this problem? What could be difference coming from? Do the policies need some tuning for the managed Cilium, or is a bug in GKE? Any help/feedback/suggestion appreciated!

Demonsthere
  • 113
  • 5
  • Please provide more details about your first environment and your corrent env. So at first you had local cluster on K3d and then you moved to GKE and created new cluster with `Dataplane V2` enabled? Did you follow any documentation? Are you using `default GCE Service Account` for GKE VMs? Did you try to check [this troubleshooting guide](https://cloud.google.com/kubernetes-engine/docs/how-to/dataplane- v2)? Could you also confirm if your environment/setup don't hit any of the limitations [form here](https://cloud.google.com/kubernetes-engine/docs/concepts/dataplane-v2#limitations)? – PjoterS Dec 24 '21 at 08:13
  • Is your db pod still listening on port 26257? Also, maybe just an issue with you masking the namespaces in the policy you posted, but are the ingress and egress policies targeting the same or different namespaces? The policy has different namespaces but the error shows the same. – Gari Singh Dec 24 '21 at 11:19
  • @PjoterS thanks for the response, I have added my k3d setup to the issue. Yes, the clusters are created from scratch with dataplaneV2, I see no errors in anted pods, and some policies (like DNS) work, while others do not. – Demonsthere Jan 03 '22 at 10:22
  • @GariSingh: Yes, the app is listening on that port. What do you mean by masking? I am using the [AutoLabel](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/#automatic-labelling) feature as policies operate on labels and not namespace names. The db server is in namespace BAR and client in FOO – Demonsthere Jan 03 '22 at 10:22
  • @Demonsthere - I just assumed the labels matched the namespaces. I could be reading things wrong, but looks like the policy allows Egress to BAR and Ingress from FOO but the error message seemed to show that both src and dest were in FOO? – Gari Singh Jan 03 '22 at 11:06
  • @GariSingh yes, the `db-server.db-client` policy is in the BAR ns, and allow ingress connections from the FOO ns for all pods with the label `policy.ory.sh/db: client`. Likewise, the `db-client.db-server` policy is in the BAR namespace, and allows all pods with the label `policy.ory.sh/db: client` to connect to namespace FOO, pods with the `policy.ory.sh/db: server` label on port `26257` In the GKE log we can see that the connection is being block on the egress route of the connection, so client -> server – Demonsthere Jan 03 '22 at 11:48
  • What GKE version are you using? Did you try to deploy cilium using Helm? Could you share all your steps how did you create Cluster, deployed cilium? I've tried docs but it seems like its outdated. Do you have any pod with name `anetd-` in your cluster? – PjoterS Jan 04 '22 at 11:43
  • Hello, how did you see those monitoring resources? I am having a very similar problem, but not using ArgoCD – Sachin Meier Apr 28 '22 at 22:07
  • @SachinMeier With the help of [Hubble GKE exporter](https://github.com/rueian/gke-hubble-export) ;) – Demonsthere Apr 30 '22 at 07:31

1 Answers1

6

Update: I was able to solve the mystery and it was ArgoCD all along. Cilium is creating an Endpoint and Identity for each object in the namespace, and Argo was deleting them after deploying the applications.

For anyone who stumbles on this, the solution is to add this exclusion to ArgoCD:

  resource.exclusions: |
    - apiGroups:
      - cilium.io
      kinds:
      - CiliumIdentity
      - CiliumEndpoint
      clusters:
      - "*"
Demonsthere
  • 113
  • 5
  • This is also a solution for anyone on DigitalOcean, which has cilium.io enabled on their managed Kubernetes clusters. You'll need to add the above to argocd-cm file. Learn more on https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/ and see example on https://github.com/argoproj/argo-cd/blob/master/docs/operator-manual/argocd-cm.yaml – Timofey Drozhzhin Jul 26 '22 at 22:30