3

I have a Seldon deployment like this:

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: mlflow
spec:
  name: wines
  predictors:
    - graph:
        children: []
        implementation: MLFLOW_SERVER
        modelUri: gs://seldon-models/mlflow/elasticnet_wine
        name: classifier
      name: default
      replicas: 1     

Model is downloaded successfully from the server, but, after a while, pods go to state crashloop and restart again and again.

When I see the logs, there is no errors since logs have re-started and I can only see how python packages are being downloaded.

PS C:\Users\xxx\mlflow> kubectl logs -p -c wines-classifier model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Executing before-run script
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
scipy-1.1.0          | 13.2 MB   | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
openssl-1.1.1g       | 2.5 MB    | ########## | 100%
mkl_fft-1.0.6        | 135 KB    | ########## | 100%
blas-1.0             | 6 KB      | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
sqlite-3.32.3        | 1.1 MB    | ########## | 100%
numpy-1.15.4         | 34 KB     | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
certifi-2020.6.20    | 156 KB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | #########  |  91%

Now, trying with -p parameter as proposed by @arghya-sadhu:

PS C:\Users\xxx\mlflow> kubectl logs -p model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp wines-classifier
---> Creating environment with Conda...
INFO:root:Copying contents of /mnt/models to local
INFO:root:Reading MLmodel file
INFO:root:Creating Conda environment 'mlflow' from conda.yaml
Warning: you have pip-installed dependencies in your environment file, but you do not list pip itself as one of your conda dependencies.  Conda may not use the correct pip to install your packages, and they may end up in the wrong place.  Please add an explicit pip dependency.  I'm adding one for you, but still nagging you.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

Downloading and Extracting Packages
scikit-learn-0.19.1  | 3.9 MB    | ########## | 100%
ncurses-6.2          | 817 KB    | ########## | 100%
_libgcc_mutex-0.1    | 3 KB      | ########## | 100%
zlib-1.2.11          | 103 KB    | ########## | 100%
tbb4py-2020.0        | 209 KB    | ########## | 100%
setuptools-47.3.1    | 514 KB    | ########## | 100%
libedit-3.1.20191231 | 167 KB    | ########## | 100%
tbb-2020.0           | 1.1 MB    | ########## | 100%
xz-5.2.5             | 341 KB    | ########## | 100%
mkl_random-1.0.1     | 324 KB    | ########## | 100%
libgcc-ng-9.1.0      | 5.1 MB    | ########## | 100%
python-3.6.9         | 30.2 MB   | ########## | 100%
libgfortran-ng-7.3.0 | 1006 KB   | ########## | 100%
libffi-3.2.1         | 40 KB     | ########## | 100%
mkl-2018.0.3         | 126.9 MB  | ########## | 100%
libstdcxx-ng-9.1.0   | 3.1 MB    | ########## | 100%
readline-7.0         | 324 KB    | ########## | 100%
intel-openmp-2019.4  | 729 KB    | ########## | 100%
tk-8.6.10            | 3.0 MB    | ########## | 100%
pip-20.1.1           | 1.8 MB    | ########## | 100%
numpy-base-1.15.4    | 3.4 MB    | ########## | 100%
wheel-0.34.2         | 51 KB     | ########## | 100%
scipy-1.1.0          | 13.2 MB   | #########3 |  93%

And the description of the pod:

PS C:\Users\ivarea\repo\smartgraph\mlflow-v2> kubectl describe pod model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Name:         model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp
Namespace:    default
Priority:     0
Node:         mlops-control-plane/172.19.0.2
Start Time:   Thu, 25 Jun 2020 10:08:20 +0200
Labels:       app=model-a-wines-classifier-0-wines-classifier
              fluentd=true
              pod-template-hash=5b8bc7889d
              seldon-app=model-a-wines-classifier
              seldon-app-svc=model-a-wines-classifier-wines-classifier
              seldon-deployment-id=model-a
              version=wines-classifier
Annotations:  prometheus.io/path: /prometheus
              prometheus.io/scrape: true
Status:       Running
IP:           10.244.0.17
IPs:
  IP:           10.244.0.17
Controlled By:  ReplicaSet/model-a-wines-classifier-0-wines-classifier-5b8bc7889d
Init Containers:
  wines-classifier-model-initializer:
    Container ID:  containerd://6a3b158cf4218f8c177f6d18eb5d0387946bf9cc36f1173754b68a029483da8b
    Image:         gcr.io/kfserving/storage-initializer:0.2.2
    Image ID:      gcr.io/kfserving/storage-initializer@sha256:7a7d3cf4c5121a3e6bad0acc9e88bbdfa9c7f774d80bd64d8e35a84dcfef8890
    Port:          <none>
    Host Port:     <none>
    Args:
      gs://seldon-models/mlflow/model-a
      /mnt/models
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Jun 2020 10:08:24 +0200
      Finished:     Thu, 25 Jun 2020 10:08:47 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /mnt/models from wines-classifier-provision-location (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Containers:
  wines-classifier:
    Container ID:   containerd://536753d25877994a17d1f1a63bbaf8717dc9180b80f061152688e4c8504c8468
    Image:          seldonio/mlflowserver_rest:0.5
    Image ID:       docker.io/seldonio/mlflowserver_rest@sha256:0fd54a0a314fafc82c490c91df0c4776be454702a307b4b76e12ed6958b4ee00
    Ports:          6000/TCP, 9000/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 25 Jun 2020 10:23:28 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 25 Jun 2020 10:19:09 +0200
      Finished:     Thu, 25 Jun 2020 10:20:41 +0200
    Ready:          False
    Restart Count:  7
    Liveness:       tcp-socket :http delay=60s timeout=1s period=5s #success=1 #failure=3
    Readiness:      tcp-socket :http delay=20s timeout=1s period=5s #success=1 #failure=3
    Environment:
      PREDICTIVE_UNIT_SERVICE_PORT:          9000
      PREDICTIVE_UNIT_ID:                    wines-classifier
      PREDICTIVE_UNIT_IMAGE:                 seldonio/mlflowserver_rest:0.5
      PREDICTOR_ID:                          wines-classifier
      PREDICTOR_LABELS:                      {"version":"wines-classifier"}
      SELDON_DEPLOYMENT_ID:                  model-a
      PREDICTIVE_UNIT_METRICS_SERVICE_PORT:  6000
      PREDICTIVE_UNIT_METRICS_ENDPOINT:      /prometheus
      PREDICTIVE_UNIT_PARAMETERS:            [{"name":"model_uri","value":"/mnt/models","type":"STRING"}]
    Mounts:
      /etc/podinfo from podinfo (rw)
      /mnt/models from wines-classifier-provision-location (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
  seldon-container-engine:
    Container ID:  containerd://938e8f7e3ac23355c8a7a475b71ab54b858aff5ca485f26b99feaba09bb60069
    Image:         docker.io/seldonio/seldon-core-executor:1.1.0
    Image ID:      docker.io/seldonio/seldon-core-executor@sha256:661173fcbc6cb4e9b56db353b19e97d04d9c086e9dc445217f84dc1721bdf894
    Ports:         8000/TCP, 8000/TCP, 5001/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      --sdep
      model-a
      --namespace
      default
      --predictor
      wines-classifier
      --http_port
      8000
      --grpc_port
      5001
      --transport
      rest
      --protocol
      seldon
      --prometheus_path
      /prometheus
    State:          Running
      Started:      Thu, 25 Jun 2020 10:08:51 +0200
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      100m
    Liveness:   http-get http://:8000/live delay=20s timeout=60s period=5s #success=1 #failure=3
    Readiness:  http-get http://:8000/ready delay=20s timeout=60s period=5s #success=1 #failure=3
    Environment:
      ENGINE_PREDICTOR:  <binary ommited>
      REQUEST_LOGGER_DEFAULT_ENDPOINT_PREFIX:  http://default-broker.
      SELDON_LOG_MESSAGES_EXTERNALLY:          false
    Mounts:
      /etc/podinfo from podinfo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6vqwk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  wines-classifier-provision-location:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  default-token-6vqwk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6vqwk
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From                          Message
  ----     ------     ----                 ----                          -------
  Normal   Scheduled  <unknown>            default-scheduler             Successfully assigned default/model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp to mlops-control-plane
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "gcr.io/kfserving/storage-initializer:0.2.2" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier-model-initializer
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier-model-initializer
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "seldonio/mlflowserver_rest:0.5" already present on machine
  Normal   Created    15m                  kubelet, mlops-control-plane  Created container wines-classifier
  Normal   Started    15m                  kubelet, mlops-control-plane  Started container wines-classifier
  Normal   Pulled     15m                  kubelet, mlops-control-plane  Container image "docker.io/seldonio/seldon-core-executor:1.1.0" already present on machine
  Normal   Created    14m                  kubelet, mlops-control-plane  Created container seldon-container-engine
  Normal   Started    14m                  kubelet, mlops-control-plane  Started container seldon-container-engine
  Warning  Unhealthy  14m (x8 over 14m)    kubelet, mlops-control-plane  Readiness probe failed: dial tcp 10.244.0.17:9000: connect: connection refused
  Warning  Unhealthy  28s (x171 over 14m)  kubelet, mlops-control-plane  Readiness probe failed: HTTP probe failed with statuscode: 503

How can I disable restarting so I can inspect logs to see the actual error?

Israel Varea
  • 2,600
  • 2
  • 17
  • 24

2 Answers2

2

Probably the default liveness and readiness probes have too short timeouts to allow the classifier container to finish installing the dependencies. Before the container starts up, Kubernetes already restarts it because it failed liveness/readiness probe.

In my case I had to add the following to Seldon deployment declaration to increase the timeouts (of course you can adjust the values):

apiVersion: machinelearning.seldon.io/v1alpha2
kind: SeldonDeployment
metadata:
  name: ...
spec:
  name: ...
  predictors:
    - graph:
        ...
      name: ...
      replicas: ...
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                readinessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3
                livenessProbe:
                  failureThreshold: 10
                  initialDelaySeconds: 120
                  periodSeconds: 30
                  successThreshold: 1
                  tcpSocket:
                    port: 9000
                  timeoutSeconds: 3

michcio1234
  • 1,700
  • 13
  • 18
0

Use -p flag as in below example command to check logs of previous terminated ruby(example) container logs from pod web-1(example)

kubectl logs -p -c ruby web-1

Check events using command kubectl get events

Use kubectl describe pod podname to check what might have caused the crashloop

Arghya Sadhu
  • 41,002
  • 9
  • 78
  • 107
  • I tried `kubectl logs -p model-a-wines-classifier-0-wines-classifier-5b8bc7889d-5t7wp wines-classifier` but it shows uncomplete previous log (see additional info supplied in question). Is there any way to make pod not to restart? – Israel Varea Jun 25 '20 at 08:27