2

I have a Kubernetes cluster and I deploy SQL Server Always On Availability Groups operator on it, but after 2 or 3 days the SQL Server pods get restarting rapidly and they don't work till I delete these pods and they deploying by the Statefulset again and they working for 2 or 3 days again.

What is happening to them?

These are my logs:

[health] ERROR: 2019/04/16 14:49:11 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:11 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:12 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:12 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:13 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:13 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[health] ERROR: 2019/04/16 14:49:14 Could not connect to local SQL instance: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:14 could not connect to local sqlservr: Unresponsive or down Unable to open tcp connection with host '127.0.0.1:1433': dial tcp 127.0.0.1:1433: getsockopt: connection refused
[supervisor] 2019/04/16 14:49:15 Getting replica name...
[supervisor] 2019/04/16 14:49:15 Replica name [mssql3-0]
[supervisor] 2019/04/16 14:49:16 Getting replica name...
[supervisor] 2019/04/16 14:49:16 Received a notification of type ADDED for secret mssql3-statefulset-secret with ResourceVersion 328866
[supervisor] 2019/04/16 14:49:16 Updating Ag Secret for ag ag1
[supervisor] 2019/04/16 14:49:16 Cached resource version: 0, current resource version: 639780
[health] 2019/04/16 14:49:16 Attempt 1 to connect to the instance at 127.0.0.1:1433 and run sp_server_diagnostics
[supervisor] 2019/04/16 14:49:16 Synchronizing users and certificates from cert secret...
[supervisor] 2019/04/16 14:49:16 Reading cert secret for mssql1-0...
[supervisor] 2019/04/16 14:49:16 Creating login dbm-mssql1...
[health] 2019/04/16 14:49:16 Connected to the instance at 127.0.0.1:1433
[supervisor] 2019/04/16 14:49:16 Creating user dbm-mssql1...
[supervisor] 2019/04/16 14:49:17 Local certificate matches the one in the cert secret
[supervisor] 2019/04/16 14:49:17 Reading cert secret for mssql2-0...
[supervisor] 2019/04/16 14:49:17 Creating login dbm-mssql2...
[supervisor] 2019/04/16 14:49:17 Creating user dbm-mssql2...
[supervisor] 2019/04/16 14:49:18 Local certificate matches the one in the cert secret
[supervisor] 2019/04/16 14:49:18 Target AGs: [{ag1 1 false}]
[supervisor] 2019/04/16 14:49:18 There is already a pod, mssql3-0, on node worker2 in the ag ag1, this statefulset will be updated with the necessary pod anti-affinity
[supervisor] 2019/04/16 14:49:18 existingAgAffinities: map[ag-service.mssql.microsoft.com/ag1:true]
[supervisor] 2019/04/16 14:49:18 agLabelsToAdd: []
[supervisor] 2019/04/16 14:49:18 Updating statefulset mssql3
[supervisor] 2019/04/16 14:49:18 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:19 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:20 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:21 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:22 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:23 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:24 Waiting for pod to be restarted...
[supervisor] 2019/04/16 14:49:25 Waiting for pod to be restarted...

And my kubectl get all is like this:

root@master:/home/ubuntu# kubectl get all -n ag1 
NAME                                  READY   STATUS             RESTARTS   AGE
pod/mssql-initialize-mssql1-hd6rd     0/1     Completed          0          3d20h
pod/mssql-initialize-mssql2-gd9hz     0/1     Completed          0          3d20h
pod/mssql-operator-6f9c99cc89-hzlsb   1/1     Running            15         2d1h
pod/mssql1-0                          1/2     CrashLoopBackOff   179        2d
pod/mssql2-0                          1/2     CrashLoopBackOff   165        3d20h
pod/mssql3-0                          1/2     CrashLoopBackOff   163        3d20h

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/ag1             ClusterIP   None             <none>        1433/TCP,5022/TCP   3d20h
service/ag1-primary     NodePort    10.106.244.51    <none>        1433:31080/TCP      3d20h
service/ag1-secondary   NodePort    10.105.101.171   <none>        1433:32497/TCP      3d20h
service/mssql1          NodePort    10.97.52.124     <none>        1433:31859/TCP      3d20h
service/mssql2          NodePort    10.100.173.32    <none>        1433:30943/TCP      3d20h
service/mssql3          NodePort    10.99.238.238    <none>        1433:32406/TCP      3d20h

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/mssql-operator   1/1     1            1           3d20h

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/mssql-operator-6f9c99cc89   1         1         1       3d20h

NAME                      READY   AGE
statefulset.apps/mssql1   0/1     3d20h
statefulset.apps/mssql2   0/1     3d20h
statefulset.apps/mssql3   0/1     3d20h

NAME                                COMPLETIONS   DURATION   AGE
job.batch/mssql-initialize-mssql1   1/1           5m38s      3d20h
job.batch/mssql-initialize-mssql2   1/1           5m35s      3d20h
job.batch/mssql-initialize-mssql3   1/1           5m22s      3d20h

One of statefulset's manifest :

apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: "2019-04-12T18:43:23Z"
  generation: 1
  labels:
    name: mssql1
    type: sqlservr
  name: mssql1
  namespace: ag1
  ownerReferences:
  - apiVersion: mssql.microsoft.com/v1
    controller: false
    kind: ReplicationController
    name: mssql1
    uid: d88e739e-5d52-11e9-9f0d-5254001850dc
  resourceVersion: "1064877"
  selfLink: /apis/apps/v1/namespaces/ag1/statefulsets/mssql1
  uid: d9c01112-5d52-11e9-9f0d-5254001850dc
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      mssql.microsoft.com/sql-instance: mssql1
  serviceName: ""
  template:
    metadata:
      creationTimestamp: null
      labels:
        ag-service.mssql.microsoft.com/ag1: ""
        mssql.microsoft.com/sql-instance: mssql1
        name: mssql1
        type: sqlservr
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: ag-service.mssql.microsoft.com/ag1
                operator: Exists
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: ACCEPT_EULA
          value: "y"
        - name: MSSQL_PID
          value: Developer
        - name: MSSQL_SA_PASSWORD
          valueFrom:
            secretKeyRef:
              key: initsapassword
              name: mssql1-statefulset-secret
        - name: MSSQL_ENABLE_HADR
          value: "1"
        image: mcr.microsoft.com/mssql/server:2019-CTP2.1-ubuntu
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 8080
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 2
          successThreshold: 1
          timeoutSeconds: 1
        name: mssql-server
        ports:
        - containerPort: 1433
          name: tds
          protocol: TCP
        - containerPort: 5022
          name: dbm
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/opt/mssql
          name: instance-root
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: no-api-access
          readOnly: true
      - command:
        - /mssql-server-k8s-ag-agent-supervisor
        env:
        - name: MSSQL_K8S_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: MSSQL_K8S_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: MSSQL_K8S_SQL_SERVER_NAME
          value: mssql1
        - name: MSSQL_K8S_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        - name: MSSQL_K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: MSSQL_K8S_MONITOR_POLICY
          value: "3"
        - name: MSSQL_K8S_HEALTH_CONNECTION_REBOOT_TIMEOUT
        - name: MSSQL_K8S_SKIP_AG_ANTI_AFFINITY
        - name: MSSQL_K8S_MONITOR_PERIOD_SECONDS
        - name: MSSQL_K8S_LEASE_DURATION_SECONDS
        - name: MSSQL_K8S_RENEW_DEADLINE_SECONDS
        - name: MSSQL_K8S_RETRY_PERIOD_SECONDS
        - name: MSSQL_K8S_ACQUIRE_PERIOD_SECONDS
        - name: MSSQL_K8S_SQL_WRITE_LEASE_PERIOD_SECONDS
        image: mcr.microsoft.com/mssql/ha:2019-CTP2.1-ubuntu
        imagePullPolicy: IfNotPresent
        name: mssql-ha-supervisor
        ports:
        - containerPort: 8080
          name: liveliness
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: mssql1
      serviceAccountName: mssql1
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: no-api-access
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  volumeClaimTemplates:
  - metadata:
      creationTimestamp: null
      name: instance-root
      namespace: ag1
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  collisionCount: 0
  currentReplicas: 1
  currentRevision: mssql1-795bb7f749
  observedGeneration: 1
  replicas: 1
  updateRevision: mssql1-795bb7f749
  updatedReplicas: 1
meisam bahrami
  • 107
  • 1
  • 8
  • Can you post your statefulset yaml? – cookiedough Apr 16 '19 at 15:40
  • @cookiedough Yes i just add to this post – meisam bahrami Apr 17 '19 at 05:53
  • This has to be the most cluttered manifest I've seen. My guess is either the `mssql-operator` is changing something, or the liveness probe is causing the restart. Have you checked the pod events? – cookiedough Apr 17 '19 at 15:50
  • @cookiedough Why you say "cluttered" ? I just get it from microsoft docs. Yesterday I edit my statefulset's and I increase the livenessProbe.TimeOutSecond to 100 second and my pods are running now. i should wait and see what goes on to them – meisam bahrami Apr 18 '19 at 09:10
  • I'm referring to the lack of structure in the file. Increasing that number probably is not going to help in this case, but do post an update on how it turned out. – cookiedough Apr 18 '19 at 14:38
  • 1
    @cookiedough I don't know but already its working and it's not getting restart.I believe the problem referring to those numbers. – meisam bahrami Apr 22 '19 at 11:34

0 Answers0