1

I am upgrading Airflow from version 1.10 to 2.1.0. My project uses KubernetesPodOperator to run tasks on KubernetesExecutor. All were working fine in Airflow 1.10. But when I upgraded Airflow 2.1.0, pods were able to run the tasks and after successful completion, it is restarting with CrashLoopBackoff status. I have checked the livenessProbe and it is working as expected. I have checked other logs, but I was not able to find any issues across any containers or pods specified.

deployment.yaml file:

# Airflows
apiVersion: apps/v1
kind: Deployment
metadata:
  name: airflow
spec:
  selector:
    matchLabels:
      app: airflow
  replicas: 1
  template:
    metadata:
        labels:
          app: airflow
    spec:
      hostAliases:
      - ip: "xx.xx.xx.xx"
        hostnames:
        - "xxx.xxx.xxx"
      initContainers:
        - name: init-db
          image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
          imagePullPolicy: Always
          command:
            - "/bin/sh"
          args:
            - "-c"
            - "/usr/local/bin/bootstrap.sh"
          env:
          - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
            valueFrom:
              secretKeyRef:
                key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
                name: airflow-secrets
          - name: AFPW
            valueFrom:
              secretKeyRef:
                key: AFPW
                name: airflow-secrets
      containers:
      - name: web
        image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
        imagePullPolicy: Always
        ports:
        - name: web
          containerPort: 8080
        command:
          - "airflow"
        args:
          - "webserver"
        livenessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 240
          periodSeconds: 60
        env:
        - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
              name: airflow-secrets
## The following values have been created as part of production setup
      - name: scheduler
        image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
        imagePullPolicy: Always
        command:
          - "airflow"
        args:
          - "scheduler"
        env:
        - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          valueFrom:
            secretKeyRef:
              key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
              name: airflow-secrets

Describing pod:

Name:         airflow-66776dc57c-z98vd
Namespace:    default
Priority:     0
Node:         gke-gke-xxxxx-de-nodes-xxxxx--ccb62dc3-24us/xxx.xx.xx.xx
Start Time:   Sat, 19 Jun 2021 17:49:16 +0000
Labels:       app=airflow
              pod-template-hash=66776dc57c
Annotations:  <none>
Status:       Running
IP:           xxx.xx.xx.xx
IPs:
  IP:           xxx.xx.xx.xx
Controlled By:  ReplicaSet/airflow-66776dc57c
Init Containers:
  init-db:
    Container ID:  xxxxxxxxx
    Image:         xxxxxxxxx
    Image ID:      xxxxxxxxx
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      /usr/local/bin/bootstrap.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 19 Jun 2021 17:50:04 +0000
      Finished:     Sat, 19 Jun 2021 17:50:23 +0000
    Ready:          True
    Restart Count:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
Containers:
  web:
    Container ID:  xxxxxxxxx
    Image:         xxxxxxxxx
    Image ID:      xxxxxxxxx
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      airflow
    Args:
      webserver
    State:          Running
      Started:      Sat, 19 Jun 2021 17:50:24 +0000
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:8080/ delay=240s timeout=1s period=60s #success=1 #failure=3
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
  scheduler:
    Container ID:  xxxxxxxxx
    Image:         xxxxxxxxx
    Image ID:      xxxxxxxxx
    Port:          <none>
    Host Port:     <none>
    Command:
      airflow
    Args:
      scheduler
    State:          Running
      Started:      Sat, 19 Jun 2021 17:50:25 +0000
    Ready:          True
    Restart Count:  0
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  default-token-kw529:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kw529
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Worker pods list and logs

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37

1 Answers1

2
restartPolicy: Always

Always means that the container will be restarted even if it exited with a zero exit code (i.e. successfully). You can explicitly specify restartPolicy: Never. It Always by default

Check Why does starting daskdev/dask into a Pod fail? for almost the same

Vit
  • 7,740
  • 15
  • 40