0

I've got a kubeflow k8s cluster with custom GPU-powered preemptible node pool at us-central1-a: enter image description here

I run a kubeflow notebook server on these GPU nodes. By some mysterious reason nodes get compute.instances.preempted message very soon after start (5-10 minutes): enter image description here

Why is this happening?

robsiemb
  • 6,157
  • 7
  • 32
  • 46
orkenstein
  • 2,810
  • 3
  • 24
  • 45
  • For how much time did you see this? Did you try the same thing with normal (non preemptible) instances? – night-gold Nov 06 '19 at 14:57
  • @night-gold I see it quite ofter, sometimes 10 minutes after node creation. Notebook server pod trigger node pool autoscaler, then after short time the node gets preempted. Non-preemptible nodes looks OK. – orkenstein Nov 06 '19 at 18:37

1 Answers1

4

Since you have created a pool of preemptible nodes, this is pretty much expected behavior. GCE can terminate preemptible instances at any time, and the only real guarantee you have is that you won't be charged for the instance (but you will be charged for any requested premium OS -- of which COS is not one) if they run for less than a minute (and, of course, that they will always be preempted after 24 hours).

GPU nodes are likely to be in high demand, and as with other preemptible instances this will be subject to the particular zone and time of day. If you need the instances to stay available, you should use full price instances. Using GKE, there is a way to autoscale GPU nodes to help control costs.

robsiemb
  • 6,157
  • 7
  • 32
  • 46
  • Sounds reasonable. Any way to diagnosis particular preemption cause? – orkenstein Nov 06 '19 at 18:43
  • Not really, but you can expect the cause to be a variant of "GCP had a customer willing to pay full price for these resources, and so you got the boot." – robsiemb Nov 06 '19 at 18:44
  • Its not super uncommon to get ZONE_RESOURCE_POOL_EXAUSTED errors for GPUs even at full price (see, for example [this question](https://stackoverflow.com/q/52586941/3399890)), so its not surprising that they'd be hard to hang on to as preemptible. If you use preemptible instances, you need to be ready for them to go away and come back frequently. – robsiemb Nov 06 '19 at 18:58
  • i've just tried with fresh node pool and the node got preempted 2 minutes after creation. This looks really strange – orkenstein Nov 06 '19 at 19:00
  • Its not strange -- if you need high demand resources during peak times, pay full price. If you need to be guaranteed to bring up instances, consider [reserving them](https://cloud.google.com/compute/docs/instances/reserving-zonal-resources) -- but you'll pay for as long as the reservation exists. – robsiemb Nov 06 '19 at 19:02
  • As you said in the comment of your question, if it's ok with non preemptible instances, then the most probable cause is that there is not enough resources to keep your instances alive. – night-gold Nov 06 '19 at 20:15