awk search and calculate standard deviation different results

Question

I am working to take the output of sar and calculate the standard deviation of a column. I can perform this successfully with a single column in a file. However when I calculate this same column in a file where I am stripping out the 'bad' lines like the title lines and avg lines, it is giving me a different value.

Here are the files I am performing this on:

/tmp/saru.tmp

# cat /tmp/saru.tmp
Linux 2.6.32-279.el6.x86_64 (progserver)        09/06/2012      _x86_64_        (4 CPU)

11:09:01 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:10:01 PM     all      0.01      0.00      0.05      0.01      0.00     99.93
11:11:01 PM     all      0.01      0.00      0.06      0.00      0.00     99.92
11:12:01 PM     all      0.01      0.00      0.05      0.01      0.00     99.93
11:13:01 PM     all      0.01      0.00      0.05      0.00      0.00     99.93
11:14:01 PM     all      0.01      0.00      0.04      0.00      0.00     99.95
11:15:01 PM     all      0.01      0.00      0.06      0.00      0.00     99.92
11:16:01 PM     all      0.01      0.00      2.64      0.01      0.01     97.33
11:17:01 PM     all      0.02      0.00     21.96      0.00      0.08     77.94
11:18:01 PM     all      0.02      0.00     21.99      0.00      0.08     77.91
11:19:01 PM     all      0.02      0.00     22.10      0.00      0.09     77.78
11:20:01 PM     all      0.02      0.00     22.06      0.00      0.09     77.83
11:21:01 PM     all      0.02      0.00     22.10      0.03      0.11     77.75
11:22:01 PM     all      0.01      0.00     21.94      0.00      0.09     77.95
11:23:01 PM     all      0.02      0.00     22.15      0.00      0.10     77.73
11:24:01 PM     all      0.02      0.00     22.02      0.00      0.09     77.87
11:25:01 PM     all      0.02      0.00     22.03      0.00      0.13     77.82
11:26:01 PM     all      0.02      0.00     21.96      0.01      0.14     77.86
11:27:01 PM     all      0.02      0.00     22.00      0.00      0.09     77.89
11:28:01 PM     all      0.02      0.00     21.91      0.00      0.09     77.98
11:29:01 PM     all      0.03      0.00     22.02      0.02      0.08     77.85
11:30:01 PM     all      0.14      0.00     22.23      0.01      0.13     77.48
11:31:01 PM     all      0.02      0.00     22.26      0.00      0.16     77.56
11:32:01 PM     all      0.03      0.00     22.04      0.01      0.10     77.83
Average:        all      0.02      0.00     15.29      0.01      0.07     84.61

/tmp/sarustriped.tmp

# cat /tmp/sarustriped.tmp                              
0.05
0.06
0.05
0.05
0.04
0.06
2.64
21.96
21.99
22.10
22.06
22.10
21.94
22.15
22.02
22.03
21.96
22.00
21.91
22.02
22.23
22.26
22.04

The Calculation based on /tmp/saru.tmp:

# awk  '$1~/^[01]/ && $6~/^[0-9]/ {sum+=$6; array[NR]=$6} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' /tmp/saru.tmp
10.7126

The Calculation based on /tmp/sarustriped.tmp ( the correct one )

# awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' /tmp/sarustriped.tmp
9.96397

Could someone assist and tell me why these results are different and is there a way to get the corrected results with a single awk command. I am trying to do this for performance so not using a separate command like grep or another awk command is preferable.

Thanks!

UPDATE

so I tried this ...

awk  '
  $1~/^[01]/ && $6~/^[0-9]/ {
    numrec += 1
    sum    += $6
    array[numrec] = $6
  } 
  END {
    for(x=1; x<=numrec; x++)
      sumsq += ((array[x]-(sum/numrec))^2)
    print sqrt(sumsq/numrec)
  }
' saru.tmp

and it works correctly for the sar -u output I was working with. I do not see why it would not work with other 'lists'. To make it short, trying to work with sar -r column 5. it is giving a wrong answer again... Output is giving 1.68891 but actual deviation is .107374... this is the same command that worked with sar -u..... if you need files I can provide. Just not sure how to make a new 'full' comment... so i just edited the old one...thanks!

For debugging this, print out some basic data: the number of items and the sum of the values (as well as the average). This will likely tell you what's different. If I had to guess, I'd suspect there's a blank line somewhere, so the counts are different. — Jonathan Leffler, Sep 07 '12 at 00:14

score 3 · Accepted Answer · edited Sep 07 '12 at 07:49

3

I think the bug is that your first awk line (the one that operates on saru.tmp) does not ignore the invalid lines, so when you do math using NR your result depends on the number of skipped lines. When you remove all of the invalid/skipped lines the result is the same from both programs. So in the first command, you should use the number of valid lines rather than NR in your math.

How about this?

awk '
  $1 ~ /^[01]/ && $6~/^[0-9]/ {
    numrec       += 1
    sum          += $6
    array[numrec] = $6
  } 
  END {
    for(x=1; x<=numrec; x++)
      sumsq += (array[x]-(sum/numrec))^2
    print sqrt(sumsq/numrec)
  }
' saru.tmp

edited Sep 07 '12 at 07:49

Thor

45,082
11
119
130

answered Sep 07 '12 at 00:17

Ivan

638
4
8

That works great, still trying to figure it out(was never good at math) but thanks very much! – user1601716 Sep 07 '12 at 01:06
@user1601716 Don't forget to accept the answer if it resolved your issue – jordanm Sep 07 '12 at 01:59
to make it short. trying to work with sar -r column 5. it is giving a wrong answer again... Output is giving 1.68891 but actual deviation is .107374... this is the same command that worked with sar -u..... if you need files I can provide. Just not sure how to make a new 'full' comment... thanks! – user1601716 Sep 07 '12 at 02:19
Please provide the new files. And can you check that it works on the "stripped" version of the sar -r command? – Ivan Sep 07 '12 at 03:55

score 0 · Answer 2 · answered Sep 07 '12 at 00:24

0

For debugging problems like this, the simplest technique is to print out some basic data. You might print the number of items, and the sum of the values, and the sum of the squares of the values (or sum of the squares of the deviations from the mean). This will likely tell you what's different between the two runs. Sometimes, it might help to print out the values you're accumulating as you're accumulating the data. If I had to guess, I'd suspect you are counting inappropriate lines (blanks, or the decoration lines), so the counts are different (and maybe the sums too).

I have a couple of (non-standard) programs to do the calculations. Given the 23 relevant lines from the multi-column output in a file data, I ran:

$ colnum -c 6 data | pstats
# Count    = 23
# Sum(x1)  =  3.557200e+02
# Sum(x2)  =  7.785051e+03
# Mean     =  1.546609e+01
# Std Dev  =  1.018790e+01
# Variance =  1.037934e+02
# Min      =  4.000000e-02
# Max      =  2.226000e+01
$

The standard deviation here is the sample standard deviation rather than the population standard deviation; the difference is dividing by (N-1) for the sample and N for the population.

answered Sep 07 '12 at 00:24

Jonathan Leffler

730,956
141
904
1,278

It's mine; I wrote it. It's a Perl script (parametric statistics). Send email if you want a copy (see my profile). – Jonathan Leffler Sep 07 '12 at 03:51
Seriously?? Why don't you upload the code if it's part of your answer? – insumity Nov 04 '13 at 19:48
@foobar: if you want to see the 130 lines of Perl, why don't you ask? I said how to do that in a comment a year ago. – Jonathan Leffler Nov 04 '13 at 20:17
@JonathanLeffler I don't consider your answer complete in this sense. Having to send you an email to get a copy is not a valid answer for me. Either you answer includs the code or not (?) I don't get this "send me an email ..." Why? Of course it's your code and you can give it away or not ... but in such a case just don't answer. – insumity Nov 04 '13 at 20:44
@foobar: It was more a question of not wanting to clutter the answer with inscrutable Perl code when the question is about `awk`. And my code is coincidental to the answer, which explains how to debug the problem by printing appropriate values. pwd Teaching how to fish is supposed to be more useful than giving fish. And there must be many sources for programs that produce basic statistics; I certainly make no claims for mine being uniquely useful -- it's just what I've got on hand. – Jonathan Leffler Nov 04 '13 at 20:53

awk search and calculate standard deviation different results

UPDATE

2 Answers2

Linked