Collect statistics of all shell scripts running

Question

I just start working for a company where they have some hundreds of shell scripts running on their batch using a "home made" scheduler and some other scripts are run manually by someone onshore/offshore

scripts were not writeen following any standard
scripts are located in different directories (again no standard)
scripts output files to different directories (again no standard)
etc

So I was thinking to perform the following steps to find out if a scripts is overruning, if it ran too fast or if the output file generated is too small, too big, etc:

Colect statistics
Create my checks
Alarm

*Without touching any of the existing unix scripts is it possible to *****

1 - Create a new shell script that will monitor all other scripts that are running and create/update the following file in real time (1 file per day)

Script Name                         pid       StartTime               EndTime                 Elapsed time       Output files  
/app/scripts/scriptA.sh  -x 222    1234     18/12/2013 12:00:00  18/12/2013 12:01:00     00:01:00         /app/data/customers222_20131218120000.dat     
                                                                                                           /app/data/temp/customers222_20131218120000.dat

/app/scripts/scriptA.sh  -x 222    2223     18/12/2013 14:00:00  18/12/2013 14:01:00     00:01:00         /app/data/customers222_20131218140000.dat     
                                                                                                           /app/data/temp/customers222_20131218140000.dat

/app/scripts/scriptA.sh  -x 333    1235     18/12/2013 12:00:00  18/12/2013 12:01:00     00:20:00         /app/data/customers222_20131218120000.dat  

/app/scripts/scriptB.sh -y 8888    1236     18/12/2013 13:00:00  18/12/2013 13:00:05     00:00:05         /app/data/suppliers888_20131318130005.dat

2 - Load monitor_running_scripts_YYYYMMDD.dat in the database to build my statistics or maybe work with files After some days colecting statistics I will know that

/app/scripts/scriptA.sh -x 222 outputs 2 files and avarage running time is 1 min /app/scripts/scriptA.sh -x 333 outputs 2 files and avarage running time is 20 min

3 - Create the alarm triggers

If /app/scripts/scriptB.sh took less than 1 minute to run then send an email to support team take a look on it
If /app/scripts/scriptB.sh took more than 5 minutes to run then send an email to support team take a look on it

I do not have any issues to build steps 2 and 3 as long step 1 is in place. So I would like to hear some suggestions on how to start doing step 1

OS: AIX

Looks like I don't have atop installed here and I can't request anything to be installed in this machine. by the way I remove the tag Linux and added "OS: AIX" — George, Dec 18 '13 at 22:19
what you propose is possible. But for step 1, you could spend a **lot** of time dealing with every possible hiccup that `ps -ef`, `top`, `?? whatever` might generate. Have you asked you boss "how much of my time do you want me to spend generating this system?" . Her response will help you decide how much time/effort you can put into that critical first part, and if your boss says 2 hrs, and you think it will take 2 days to build something that is 95% accurate, then you have to have another discussion with your boss, trades offs, trades offs ;-) .... Good luck! — shellter, Dec 18 '13 at 22:59

hek2mgl · Answer 1 · 2013-12-18T22:18:49.140

1

Basically you could just replace /bin/bash or /bin/sh or whatever shell with a script that wraps the shell execution and logs to that file. Wrapping bash could look like:

# backup shell
cp /bin/bash{,_orig}
# create wrapper (overwrite shell)
cat <<EOF > /bin/bash
#!/bin/bash_orig

# get starting time
time_start=$(date)
command="$0"

# start original shell
bash_orig $@
pid=$!

# stop time
time_stop=$(date)

# write to log
echo "$command $pid $time_start $time_stop" >> monitor.log
EOF

If there are more shells on your system, it requires a little bit more attention but this should point you into the right direction.

If you once finished monitoring you can disable this using:

mv /bin/bash{_orig,}

to get back the original shell.

edited Dec 18 '13 at 22:18

answered Dec 18 '13 at 22:06

hek2mgl

152,036
28
249
266

This doesn't log output files. Do you have `/proc/self/fd/1` and friends on AIX? See further http://stackoverflow.com/questions/1188757/getting-filename-from-file-descriptor-in-c – tripleee Dec 18 '13 at 22:26
Ups, I missed to scroll to the right :) .. You may enhance my post and we make it community wiki if you want. I think the basics - wrapping `sh` should be ok, isn't it? – hek2mgl Dec 18 '13 at 22:38
thanks for your reply hek2mgl. My prod and test machines are very busy environments and since I not sure what exactly it could affect by running the above commands I will have to read more how exaclty wrappers works, etc. This would cover the START and END date only but it's still welcome – George Dec 18 '13 at 22:45
Yeah, it should at least point you into the right direction. Try it on a virtual machine before using in prod environment ;) – hek2mgl Dec 18 '13 at 22:47
Hi tripleee, I dont't have permission on those files under my user or the application user(that I can sudo) – George Dec 18 '13 at 22:53

Collect statistics of all shell scripts running

1 Answers1