0

I use helm and I have a problem, when a pod (statefulset) is entring to CrashLoopBackOff, it never exit this state.

Even when there is a new rolling update, the statefulset still in the same state CrashLoopBackOff from the old rolling update.

Question

What can I do to force the statefulset to start the new rolling update (or even better, gracefully)?

  • An answer for k8s-deployment will also be great!
Stav Alfi
  • 13,139
  • 23
  • 99
  • 171
  • Does this answer your question? [Redeploy statefulset with CrashLoopBackOff status in kubernetes](https://stackoverflow.com/questions/66612592/redeploy-statefulset-with-crashloopbackoff-status-in-kubernetes) – Matthias M Mar 06 '22 at 20:07
  • Thanks but no because specifing `spec.podManagementPolicy: "Parallel"` is not what I'm looking for. I need the statefulsets to start one after the other. – Stav Alfi Mar 15 '22 at 05:39

2 Answers2

1

You can force helm to recreate resources by adding --force to the switches, for example

helm upgrade --install -n mynamespace --force myrelease ./mychart

This will delete and recreate the resources, including the statefulset pods. This may (YMMV) fix your problem, it may not. It depends on that cause of the crashloop, so you should ideally fix that before even considering forcing a new rolling update. Or at least patch the statefulset so it's running correctly, before doing the update.

Blender Fox
  • 4,442
  • 2
  • 17
  • 30
  • I don't understand your answer. On one hand, you specify the `--force` flag as an answer. On the other hand, you wrote to avoid it and to manually check the cause of the `CrashLoopBackOff` state. In my case, the cause of the `CrashLoopBackOff` state is a bug in the service-code, so I don't see a use case for infinite restarts of the same pod and to avoid new rolling-update. – Stav Alfi Mar 05 '22 at 18:28
  • Also, is there a timeout mechanism for a rolling-update? – Stav Alfi Mar 05 '22 at 18:31
  • `--force` will delete and recreate the resources, this _may_ fix your problem, but you should fix the cause of the crashloop instead. You have said you know the cause is a bug in your code, -- in that case then you won't need to use the `--force` option, you should fix the crashloop problem first such that the pods are all showing as "Running", then you can run an update normally. – Blender Fox Mar 05 '22 at 18:57
  • In response to your other question, yes. The timeout switch will timeout the action (whatever the action is) and if `--atomic` is also set, helm will try to rollback its changes – Blender Fox Mar 05 '22 at 18:58
  • As I stated in the question, if a pod is in `CrashLoopBackOff` state, it will hang in this state forever and block future rolling-updates. So I can't "fix the crashloop problem first". That's why I opened this question in the first place. So is your answer: adding `--force`? or `--atomic --timeout`? And also, what I am describing sounds like a major bug of helm/k8s, so maybe I'm misunderstanding something here. – Stav Alfi Mar 05 '22 at 19:34
  • I don't think it's a kubernetes or helm bug. You tell the cluster the desired state of the application you are deploying. For example "I want a stateful set of 4 replicas, using docker image `xyz`" and it will try to fulfil that request. If the docker image contains code that terminates instead of continuously running, it will try to restart it to continue to fulfil you request, and will keep trying to do this. Out of curiosity, what is your code trying to do? – Blender Fox Mar 05 '22 at 19:50
  • it's a server code and there is a bug in the start of the server. nothing important or intersting here. but my question is - why helm/k8s doesn't give me a chance to deploy a fix in a new rolling-update to the statefulset? – Stav Alfi Mar 05 '22 at 20:02
  • What error do you get when trying to do the update? – Blender Fox Mar 05 '22 at 20:09
  • no error. the pods are never restart with the new image-version. they restart for ever with the old image-version in a `CrashLoopBackOff` state. – Stav Alfi Mar 06 '22 at 04:48
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242647/discussion-between-blender-fox-and-stav-alfi). – Blender Fox Mar 06 '22 at 09:45
1

Assumed the installation was succeeded before. You need to fix the CrashLoopBackOff first by running helm rollback <release> --namespace <if not default> --timeout 5m0s <revision #>, then only you do helm upgrade with the new image. You can get the list of revision # by helm history <release> --namespace <if not default>.

gohm'c
  • 13,492
  • 1
  • 9
  • 16
  • I don't want to do it manually. I want to fix the error by upgrading the pod with a new image-version which contains the fix. I don't want to do anything manual. Is there a way to do it? – Stav Alfi Mar 06 '22 at 04:52
  • `manually` - what this mean, you meant you don't want to use `helm` command? Note your previous installation was release using `helm` command. Or you meant to ignore helm history? – gohm'c Mar 06 '22 at 04:54
  • I mean that, I understand that `helm rollback --namespace --timeout 5m0s` will revert to a working pods, but not doing so, shouldn't block me from upgrading the statefulset with `helm upgrade ... new-image-version`. – Stav Alfi Mar 06 '22 at 05:05
  • It will block you because your previous attempt has failed BUT was not cancel properly. Therefore you need to rollback to last known good state first before trying again. – gohm'c Mar 06 '22 at 05:06