We wrote a monitoring system. This monitor is made of agents. Each agent runs on a different server, and monitors that specific server resources (RAM, CPU, SQL Server Status, Replication Status, Free Disk Space, Internet Access, specific bussiness metrics, etc.).
The agents report every measure they take to a central database where these "observations" are stored.
For example, every few seconds an agent would store in the central database a specific bussiness metric called "unprocessed_files" with its corresponding value:
(unprocessed_files, 41)
That value is constanty being written to our DB (among many others, as explained above).
We are now implementing a client application, a screen, that displays the status of every thing we monitor. So, how can we calculate what's a "normal" value and what's a wrong value?
For example, we know that if our servers are working correctly, the unprocessed_files would always be close to 0, but maybe (We don't know yet), 45 is an acceptable value.
So the question is, should we use the Standard Deviation in order to know what the acceptable range of values is?
ACCEPTABLE_RANGE = AVG(value) +- STDDEV(value) ?
We would like to notify with a red color when something is not going well.