Rolling update strategy not giving zero downtime in live traffic

Question

I'm using rolling update strategy for deployment using these two commands:

kubectl patch deployment.apps/<deployment-name> -n <namespace> -p '{\"spec\":{\"template\":{\"metadata\":{\"labels\":{\"date\":\"`date +'%s'`\"}}}}}' 
kubectl apply -f ./kube.deploy.yml -n <namespace>
kubectl apply -f ./kube_service.yml -n <namespace>

YAML properties for rolling update:

 apiVersion: extensions/v1beta1

kind: Deployment

metadata:

  name: "applyupui-persist-service-deployment"

spec:

  # this replicas value is default

  # modify it according to your case

  replicas: 2

  strategy:

    type: RollingUpdate

    rollingUpdate:

      maxSurge: 1

      maxUnavailable: 20%

  template:

    metadata:

      labels:

        app: "applyupui-persist-service-selector"

    spec:

      hostAliases:
        - ip: "xx.xx.xx.xxx"

          hostnames:

          - "kafka02.prod.fr02.bat.cloud"


      imagePullSecrets:

        - name: tpdservice-devops-image-pull-secret

      containers:

        - name: applyupui-persist-service

          image: gbs-bat-devops-preprod-docker-local.artifactory.swg-devops.com:443/applyupui-msg-persist-service:latest

          imagePullPolicy: Always

          env:

          - name: KAFKA_BROKER

            value: "10.194.6.221:9092,10.194.6.221:9093,10.194.6.203:9092"

          - name: SCYLLA_DB

            value: "scylla01.fr02.bat.cloud,scylla02.fr02.bat.cloud,scylla03.fr02.bat.cloud"

          - name: SCYLLA_PORT

            value: "9042"            

          - name: SCYLLA_DB_USER_ID

            value: "kafcons"

          - name: SCYLLA_DB_PASSWORD

            value: "@%$lk*&we@45"

          - name: SCYLLA_LOCAL_DC_NAME

            value: "Frankfurt-DC"

          - name: DC_LOCATION

            value: "FRA"

          - name: kafka.consumer.retry.topic.timeout.interval            

            value: "100"

          - name: kafka.consumer.retry.topic.max.retry.count

            value: "5"

          - name: kafka.consumer.dlq.topic.timeout.interval

            value: "100"

          - name: kafka.producer.timeout.interval

            value: "100"             

          - name: debug.log.enabled

            value: "false"

          - name: is-application-intransition-phase

            value: "false"

          - name: is-grace-period

            value: "false"             

          - name: SCYLLA_KEYSPACE

            value: "bat_tpd_pri_msg"

          readinessProbe:

            httpGet:

             path: /greeting

             port: 8080

            initialDelaySeconds: 3

            periodSeconds: 10

            successThreshold: 1

            timeoutSeconds: 1

      nodeSelector:

        deployment: frankfurt

        # resources:

        #   requests:

        #     cpu: 100m

        #     memory: 100Mi

I tried changing maxsurge and maxunavailable parameters and different initialdelayseconds parameter. Additionally, I tried giving the livelinessprobe parameter

 livenessprobe:
            tcpSocket:
              port: 8080
            initialDelaySeconds: 15
            periodSeconds: 20

, but none of it worked. It gives error in connection indicating some pod is down and hence there is a downtime.

Could you please share the part of your yaml with the `strategy: type: RollingUpdate` (or the whole deployment)? I see you provided info regarding it but I need to be sure it is without errors. — Wytrzymały Wiktor, May 14 '20 at 11:45
Please edit your question instead of posting an answer. Please share your deployment yaml in his exact form for us to see if there are any mistakes. — Wytrzymały Wiktor, May 19 '20 at 09:18

score 0 · Answer 1 · answered May 25 '20 at 12:18

First of all you need to make sure your yaml file is correct and all indentations are in place. After that you need set the values right in order to achieve a zero-downtime update. The examples below shows correctly defined RollingUpdates:

spec:
  replicas: 2
  strategy:
   type: RollingUpdate
   rollingUpdate:
     maxSurge: 1
     maxUnavailable: 0

In this example there would be one additional Pod (maxSurge: 1) above the desired number of 2, and the number of available Pods cannot go lower than that number (maxUnavailable: 0).

Choosing this config, the Kubernetes will spin up an additional Pod, then stop an “old” one. If there’s another Node available to deploy this Pod, the system will be able to handle the same workload during deployment. If not, the Pod will be deployed on an already used Node at the cost of resources from other Pods hosted on the same Node.

You can also try something like this:

spec:
  replicas: 2
  strategy:
   type: RollingUpdate
   rollingUpdate:
     maxSurge: 0
     maxUnavailable: 1

With the example above there would be no additional Pods (maxSurge: 0) and only a single Pod at a time would be unavailable (maxUnavailable: 1).

In this case, Kubernetes will first stop a Pod before starting up a new one. The advantage of that is that the infrastructure doesn’t need to scale up but the maximum workload will be less.

If you chose to use the percentage values for maxSurge and maxUnavailable you need to remember that:

maxSurge - the absolute number is calculated from the percentage by rounding up
maxUnavailable - the absolute number is calculated from percentage by rounding down

With the RollingUpdate defined correctly you also have to make sure your applications provide endpoints to be queried by Kubernetes that return the app’s status. Below it's a /greeting endpoint, that returns an HTTP 200 status when it’s ready to handle requests, and HTTP 500 when it’s not:

readinessProbe:
  httpGet:
    path: /greeting
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  successThreshold: 1
  timeoutSeconds: 1

initialDelaySeconds - Time (in seconds) before the first check for readiness is done.
periodSeconds - Time (in seconds) between two readiness checks after the first one.
successThreshold - Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1 for liveness. Minimum value is 1.
timeoutSeconds - Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.

I have tested the above scenarios with success.

Please let me know if that helped.

Rolling update strategy not giving zero downtime in live traffic

1 Answers1