2

EDIT: Working script below

I have used this site MANY times to get answers, but I am a little stumped with this.

I am tasked with writing a script, in bash, to log into roughly 2000 Unix servers (Solaris, AIX, Linux) and check the size of OS filesystems, most notable /var /usr /opt.

I have set some variables, which may be where I am going wrong right off the bat.

1.) First I am connecting to another server that has a list of all hosts in the infrastructure. Then I parse this data with some sed commands to get a list I can use properly

1.) Then I do a ping test, to see if the server is alive. If the server is decom. The idea behind this, is if the server is not pingable, I don't want it being reported on, or any attempt to be made to connect to it, as it is just wasting time. I feel I am doing this wrong, but don't know how to do it corectly (a re-occurring theme you will here in this post lol)

If any FS is over 80% mark, then it should output to a text file with the servername, filesystem, size on one line <== very important for me

If the FS is under 80% full, then I don't want it in my output, it can me omitted completely.

I have created something that I will post below, and am hoping to get some help in figuring out where I am going wrong. I am very new to bash scripting, but have experience as a Unix admin (i have never been good at scripting).

Can anyone provide some direction and teach me where I am going wrong?

I will upload my script that i can confirm is working hopefully tomorrow. thanks everyone for your input in this!

Jay Jay
  • 21
  • 1
  • 5
  • If I run my script in a window, and tail the output file in another window, I can see in the output file that servers that are not pingable, appear in there. However, none of the servers that are a.) alive & b.) have an FS over 80% appear in there Server VARfs USRfs OPTfs clusternode1a is offline apache3 is offline foaedf34 is offline etrgpu09 is offline fotwc31r is offline – Jay Jay Aug 23 '16 at 14:24
  • If you need to run this in multiple operating systems, you probably can't rely on `/bin/bash` existing everywhere. Write a POSIX compatible shell script instead. – ghoti Aug 23 '16 at 14:25
  • Thanks ghoti bash exists everywhere in our infrastructure. – Jay Jay Aug 23 '16 at 14:27
  • Fair enough. Are you opposed to caching operating system information on your management machine, or is it likely that a machine might switch from one operating system to another between runs of your disk space checker? – ghoti Aug 23 '16 at 14:29
  • Also, have you thought of running a tool actually designed for this sort of thing, like perhaps [Munin](http://munin-monitoring.org/)? – ghoti Aug 23 '16 at 14:30
  • By caching this information on the management server, how will that benefit? I am not allowed under change management rules to modify any systems without a valid change record, which is a lengthy process. – Jay Jay Aug 23 '16 at 14:30
  • You run `ssh` to determine the OS type before every check. That's wasteful of both compute and network resources. Store the info in a text file, then select your commands based on what's already known. – ghoti Aug 23 '16 at 14:31
  • I agree it is wasteful in many regards but this is the way I have been taught. I am trying to learn more effective and efficient scripting. I am very new at this, so I am taking things I have learned, and applying them to new scripts that I am asked to make. Please teach me a better way :) – Jay Jay Aug 23 '16 at 14:33
  • the only reason I do the ssh so many times is because the output of df differs from solaris, to aix, to Linux, so I have a 'custom' df command for each type of OS. I know there is better way to do this, I'mm just not sure how – Jay Jay Aug 23 '16 at 14:36
  • *Then I do a ping test, to see if the server is alive.* Bad idea. There are numerous reasons why a ping to a working server can fail. That's like testing if your car runs by trying to put gas in the tank - what if it's an electric car? – Andrew Henle Aug 23 '16 at 14:43
  • the only tool I am allowed to do this with is a shell script unfortunately. – Jay Jay Aug 23 '16 at 14:43
  • *the only tool I am allowed to do this with is a shell script unfortunately.* One wonders if the person(s) responsible for such an edict would require the workers putting up their houses to not use power tools - only a manual hand saw and a small 16 oz hammer - no big framing hammers, either! – Andrew Henle Aug 23 '16 at 14:50
  • What is `grep ^adm all_vms.txt | sed -i 's/^adm//g' all_vms.txt` supposed to do? – ghoti Aug 23 '16 at 14:53
  • grep ^adm all_vms.txt | sed -i 's/^adm//g' all_vms.txt removed the 'adm' from the beginning of some select hostnames the hostnames are admserver1 admserver2 admserver3 as an example however those aren't reachable via out dns, so I have to remove the 'adm' at the beginning (i don't know why this is set up this way, it just is) – Jay Jay Aug 23 '16 at 14:56

3 Answers3

0

Some trouble is here:

ping -c 1 -W 3 $i > /dev/null 2>&1
    if [ $? -ne 0 ]; then
            echo "$i is offline" >> $LOG
    fi

You need a continue statement inside that if. Your program isn't really treating non-pingable hosts differently, just logging they're not pingable.

Okay, now I'm looking a little deeper, and there's more naive stuff in here. These shouldn't work:

SOLVARFS=$(df -h /var |cut -f5 |grep -v capacity |awk '{print $5}')
SOLUSRFS=$(df -h /usr |cut -f5 |grep -v capacity |awk '{print $5}')
SOLOPTFS=$(df -h /opt |cut -f5 |grep -v capacity |awk '{print $5}')

etc...

The problem with these lines is, the command substitution gets assigned to the variables before the ssh session happens. So the content of each variable is the command's result on your local system, not the command itself. Since you're doing command substitution around your ssh calls, it might well work just to rewrite these lines as (note the backslash escapes on $5):

SOLVARFS="df -h /var |cut -f5 |grep -v capacity |awk '{print \$5}'"
SOLUSRFS="df -h /usr |cut -f5 |grep -v capacity |awk '{print \$5}'"
SOLOPTFS="df -h /opt |cut -f5 |grep -v capacity |awk '{print \$5}'"

etc...

The part where you're contacting another server has some more stuff to correct. You don't need three if statements per server, and there's no reason to echo anything to /dev/null. Here's a rewrite for the SunOS section. For each directory you're checking, it outputs the host name, the command name (so you can see which dir was being checked), and the result:

if [[ $UNAME = "SunOS" ]]; then
    for SSH_COMMAND in SOLVARFS SOLUSRFS SOLOPTFS ; do
        RESULT=`ssh  -o PasswordAuthentication=no -o BatchMode=yes -o StrictHostKeyChecking=no -o ConnectTimeout=2 GSSAPIAuthentication=no -q $i ${!SSH_COMMAND}`
        if ["$RESULT" -gt 80] ; do
            echo "$i, $SSH_COMMAND, $RESULT" >> $LOG
        fi
    done
fi

Note that the ${!BLAH} construction is variable indirection. "Give me the contents of the variable named by BLAH".

Juan Tomas
  • 4,905
  • 3
  • 14
  • 19
  • *You need a `continue` statement inside that if. Your program isn't really treating non-pingable hosts differently, just logging they're not pingable.* Not treating non-pingable hosts differently is good - just because they don't respond to pings doesn't mean they're not up. It just means they don't respond to pings. – Andrew Henle Aug 23 '16 at 14:48
  • Thanks for the help, is this how it should be SOLVARFS="df -h /var |cut -f5 |grep -v capacity |awk '{print \$5}'" SOLUSRFS="df -h /usr |cut -f5 |grep -v capacity |awk '{print \$5}'" SOLOPTFS="df -h /opt |cut -f5 |grep -v capacity |awk '{print \$5}'" LINVARFS="df -hP /var |awk '{ print $5 }' |grep -v Use%" LINUSRFS="df -hP /usr |awk '{ print $5 }' |grep -v Use%" LINOPTFS="df -hP /opt |awk '{ print $5 }' |grep -v Use%" AIXVARFS="df /var |awk '{ print $4 }' |grep -v %Used" AIXUSRFS="df /usr |awk '{ print $4 }' |grep -v %Used" AIXOPTFS="df /opt |awk '{ print $4 }' |grep -v %Used" – Jay Jay Aug 23 '16 at 14:51
  • @Andrew OP specifically wanted to treat hosts differently if they're non-pingable. Yes, in a general sense, non-pingableness doesn't prove much. But it's the criterion for the OP, and it's likely to be reliable within the OP's controlled network environment. – Juan Tomas Aug 23 '16 at 15:05
  • Hi Juan thanks so much what is SSH_COMMAND where is that defined? – Jay Jay Aug 23 '16 at 15:19
  • It's defined in the `for` loop. It will be set to `SOLVARFS`, `SOLUSRFS`, and `SOLOPTFS` in order (note: not set to the contents of those variables, just the variable names). Then you can access the actual commands by `${!SSH_COMMAND}`. – Juan Tomas Aug 23 '16 at 15:25
0

Here is my "disk usage" linux script, i hope that help you.

#!/bin/sh

df -H | awk '{ print $5 " " $6 }' | while read output;
do
  echo $output
  usep=$(echo $output | awk '{ print $1}' | cut -d'%' -f1  )
  partition=$(echo $output | awk '{ print $2 }' )
  if [ $usep -ge 90 ]; then
    echo "Running out of space \"$partition ($usep%)\" on $(hostname) as on $(date)" |
     mail -s "Warning! There is no space on the disk: $usep%" root@domain.com
  fi
done
szpal
  • 647
  • 9
  • 27
  • Note that you can use parameters to specify the columns in `df`: [How to select a particular column in linux df command](http://stackoverflow.com/a/28809214/1983854). Not in all the versions, though. – fedorqui Aug 23 '16 at 14:36
  • 1
    @fedorqui - note that in this case, the df command is being run on three different operating systems, so Linux-only options won't help. – ghoti Aug 23 '16 at 15:42
  • @ghoti thanks for the comment. I hadn't dug very much in the question, so I wasn't aware of this. – fedorqui Aug 23 '16 at 15:44
0

Your original script does a bunch of things less-than-optimally. Rather than running an almost-identical block of code for each filesystem and each operating system, the thing to do would be to record the differences in a way that a SINGLE piece of code can iterate over all your objects, adapting as required.

Here's my take on this. Commands should appear ONCE, but

  • they get run multiple times by loops, and
  • they get run multiple ways using arrays.

The following script passes lint checks, but obviously this is untested, as I don't have your environment to test in. You might still want to think about how your logging and notifications work.

#!/bin/bash

# Assign temp file, remove it automatically upon successful exit.
tmpfile=$(mktemp /tmp/${0##*/}.XXXX)
trap "rm '$tmpfile'" 0

#NOW=$(date +"%Y-%m-%d-%T")
NOW=$(date +"%F")

LOG=/usr/scripts/disk_usage/Unix_df_issues-$NOW.txt
printf '' > "$LOG"

# Use variables to refer to commonly accessed files. If you change a name, just do it once.
rawhostlist=all_vms.txt
host_os=${rawhostlist}_OS

# Commonly-used options need only be declared once. Use an array for easier management.
declare -a ssh_opts=()
ssh_opts+=(-o PasswordAuthentication=no)
ssh_opts+=(-o BatchMode=yes)
ssh_opts+=(-o StrictHostKeyChecking=no) # Eliminate prompts on new hosts
ssh_opts+=(-o ConnectTimeout=2)         # This should make your `ping` unnecessary.
ssh_opts+=(-o GSSAPIAuthentication=no)  # This is default. Do we really need it?

# Note: Associative arrays require Bash 4.x.
declare -A df_opts=(
  [SunOS]="-h"
  [Linux]="-hP"
  [AIX]=""
)
declare -A df_column=(
  [SunOS]=5
  [Linux]=5
  [AIX]=4
)

# Fetch host list from configserver, stripping /^adm/ on the remote end.
ssh "${ssh_opts[@]}" -q configserver "sed 's/^adm//' /reports/*/HOSTNAME" > "$rawhostlist"

# Confirm that our host_os cache is up to date and process any missing hosts.
awk '
  NR==FNR { h[$1]; next }   # Add everything in rawhostlist to an array...
  { delete h[$1] }          # Then remove any entries that exist in host_os.
  END {
    for (i in h) print i    # And print whatever remains.
  }' "$rawhostlist" "$host_os" |
    while read h; do
      printf '%s\t%s\n' "$h" $(ssh "$h" "${ssh_opts[@]}" -q uname -s)
    done >> "$host_os"

# Next, step through the host list and collect data.
while read host os; do
  ssh "${ssh_opts[@]}" "$host" df "${df_opts[$os]}" /var /usr /opt |
    awk -v column="${df_column[$os]}" -v host="$host" 'NR>1 { print host,$1,$column }'
  )
done < "$host_os" > "$tmpfile"

# Now that we have all our data, check for warning/critical levels.
while read host filesystem usage; do
  if [ "$usage" -gt 80 ]; then
    status="CRITICAL"
  elif [ "$usage" -gt 70 ]; then
    status="WARNING"
  else
    continue
  fi
  # Log our results to our log file, AND send them to stderr.
  printf "[%s] %s: %s:%s at %d%%\n" "$(date +"%F %T")" "$status" "$host" "$filesystem" "$usage" | tee -a "$LOG" >&2
done < "$tmpfile"

# Email and record our results.
if [ -s "$LOG" ]; then
  mail -s "Daily Unix  /var Report - $NOW" unixsystems@examplle.com < "$LOG"
  mv "$LOG" /var/log/vm_reports/
fi

Consider this example code. If you like the way it looks, your next task is to debug it, or open new questions for parts that you're having trouble debugging. :-)

ghoti
  • 45,319
  • 8
  • 65
  • 104
  • thanks you VERY much, I will play around with this and make necessary changes :) I will report back – Jay Jay Aug 24 '16 at 12:49