10

I'm running some slightly unreliable software on some instances in an instance group. The software is installed and run by a startup script, and most of the time it works without issue, but about ~10% of the new instances run out of memory and crash due to some sort of memory leak in the software. I can't get this leak fixed myself, so in the meantime, I've been checking the instances every few hours and killing any that show an idle CPU (the software consumes all available CPU power normally).

However, I'm using preemptible instances, and they can be killed off and restarted at any time, leaving dead instances running whenever I'm not actively monitoring them. After a day of leaving things unattended, I usually see ~80-85% CPU usage in the dashboard, the rest of which is wasted.

Is there any automated way I can kill off these dead instances? Restarting them is already handled by the instance group.

James
  • 1,239
  • 1
  • 11
  • 18
  • see these Q&As on the serverfault: http://serverfault.com/questions/694502/google-compute-engine-cpu-usage-alarms/694555 http://serverfault.com/questions/694532/gce-metadata-get-instance-name – Kamran May 31 '15 at 14:31

4 Answers4

15

The following worked for me. It's a bash script which uses the uptime UNIX command to check whether the 15-minute average load on the CPU is below a threshold, and automatically shuts down the system if this is true on ten consecutive checks. You need to run this within your VM instance.

Credit, and more detailed explanation: Rohit Rawat's blog.

#!/bin/bash
threshold=0.4

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
  res=$(echo $load'<'$threshold | bc -l)
  if (( $res ))
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    sudo poweroff
  fi

  sleep 60

done
viswajithiii
  • 449
  • 4
  • 8
2

It seems like there are two parts to this question:

  1. Identifying dead instances.
  2. Killing off those instances.

In terms of identifying dead instances, one way to do this would be to have a separate, management instance that does not run this software and that keeps tabs on the other instances. For example, it could do this by periodically sending a health request to the various instances and marking non-responsive instances or instances reporting an overly high CPU usage as unhealthy.

Once your management instance has identified the unhealthy instances that need to be reset, you should be able to reset those other instances using the API (I'm guessing the reset command) or by executing the same operation using the gcloud commandline tool.

Michael Aaron Safyan
  • 93,612
  • 16
  • 138
  • 200
  • Although this answer is rather more vague than I was hoping, it did end up leading me in the right direction. I eventually wrote a small script that occasionally checks the 15 minute load average and kills the machine it's on if it drops below `0.50`. It's a bit of a kludge, but it means I don't need to run a dedicated monitoring instance. I'll mark your answer as accepted, as it is a reasonable solution, even if it wasn't what I was looking for. – James Jun 07 '15 at 16:57
2

I wish I could add this as a comment to viswajithiii answer but I'm just shy of the reputations necessary to comment.

I found the static threshold variable to be inappropriate when I am using cloud VM's with variable numbers of cpu's as the output of uptime scales with the number of CPU's as discussed here.

My updated script adds two lines below the threshold assignment to scale the threshold by the number of cpu's. This allows me to set a percentage cpu utilization that will work across VM's with different numbers of cpu's.

Otherwise, the script is the same as viswajithiii's.

#!/bin/bash

threshold=0.4
n_cpu=$( grep 'model name' /proc/cpuinfo | wc -l )
threshold=$( echo $n_cpu*$threshold | bc )

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
  res=$(echo $load'<'$threshold | bc -l)
  if (( $res ))
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    sudo poweroff
  fi

  sleep 60

done
dzubke
  • 463
  • 4
  • 10
  • 1
    One thing to note with this (and the other scripts here) is that the counter doesn't reset when the cpu usage goes above the threshold so the counter can creep up and then the machine eventually can shutdown after a single period with below threshold cpu usage. Just add `else count=0` to the first if-statement to reset the counter – Jacob Lauritzen Feb 09 '23 at 15:50
0

This works without bc (not in GCP Container OS) using viswajithiii's answer and this post: How can I replace 'bc' tool in my bash script?

It also appends the history list to file before poweroff. I set my threshold very low, but the load is showing 0.00 even when I'm editing files via cli. Might work better if instance is under heavy load.

#!/bin/bash
threshold=10

count=0
while true
do

  load=$(uptime | sed -e 's/.*load average: //g' | awk '{ print $3 }')
  load2=$(awk -v a="$load" 'BEGIN {print a*100}')
  echo $load2
  if [ $load2 -lt $threshold ]
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    history -a
    sudo poweroff
  fi

  sleep 60

done

That's not working for my low cpu, but this seems too:

#!/bin/bash
threshold=1

count=0
while true
do

  load=$(awk '{u=$2+$4; t=$2+$4+$5; if (NR==1){u1=u; t1=t;} else print ($2+$4-u1) * 1000 / (t-t1); }' <(grep 'cpu ' /proc/stat) <(sleep 1;grep 'cpu ' /proc/stat))
  load2=$(printf "%.0f\n" $load)  
  echo $load
  echo $load2
  if [[ $load2 -lt $threshold ]]
  then
    echo "Idling.."
    ((count+=1))
  fi
  echo "Idle minutes count = $count"

  if (( count>10 ))
  then
    echo Shutting down
    # wait a little bit more before actually pulling the plug
    sleep 300
    history -a
    sudo poweroff
  fi

  sleep 60

done

It only works with both echo loads for some reason.

credits:

How to get overall CPU usage (e.g. 57%) on Linux https://unix.stackexchange.com/questions/89712/how-to-convert-floating-point-number-to-integer

FYI: according to here, GCP monitoring agent is not available for N type instances: Google Cloud Platform: how to monitor memory usage of VM instances

Put this in a startup script in /etc/my_init.d and make it executable:

sudo mkdir /etc/my_init.d
sudo mv autooff.sh /etc/my_init.d/autooff.sh
sudo chmod 755 /etc/my_init.d/autooff.sh

Actually, that's being deleted. Instead add to Custom Metadata in Edit for the instance: startup-script and #! /bin/bash \n~./autooff.sh

alchemy
  • 954
  • 10
  • 17