Compute average and standard deviation with awk

Question

I have a 'file.dat' with 24 (rows) x 16 (columns) data.

I have already tested the following awk script that computes de average of each column.

touch aver-std.dat
awk '{   for (i=1; i<=NF; i++) { sum[i]+= $i } }
END { for (i=1; i<=NF; i++ )  
{ printf "%f \n", sum[i]/NR} }' file.dat >> aver-std.dat

The output 'aver-std.dat' has one column with these averages.

Similarly as the average computation I would like to compute the standard deviation of each column of the data file 'file.dat' and write it in a second column of the output file. Namely I would like an output file with the average in the first column and the standard deviation in the second column.

I have been making different tests, like this one

touch aver-std.dat
awk '{   for (i=1; i<=NF; i++) { sum[i]+= $i }}
END { for (i=1; i<=NF; i++ )  
{std[i] += ($i - sum[i])^2 ; printf "%f %f \n", sum[i]/NR, sqrt(std[i]/(NR-1))}}' file.dat >> aver-std.dat

and it writes values in the second column but they are not the correct value of the standard deviation. The computation of the deviation is not right somehow. I would appreciate very much any help. Regards

Hari Menon · Answer 1 · 2016-10-19T22:16:50.283

35

Standard deviation is

stdev = sqrt((1/N)*(sum of (value - mean)^2))

But there is another form of the formula which does not require you to know the mean beforehand. It is:

stdev = sqrt((1/N)*((sum of squares) - (((sum)^2)/N)))

(A quick web search for "sum of squares" formula for standard deviation will give you the derivation if you are interested)

To use this formula, you need to keep track of both the sum and the sum of squares of the values. So your awk script will change to:

    awk '{for(i=1;i<=NF;i++) {sum[i] += $i; sumsq[i] += ($i)^2}} 
          END {for (i=1;i<=NF;i++) {
          printf "%f %f \n", sum[i]/NR, sqrt((sumsq[i]-sum[i]^2/NR)/NR)}
         }' file.dat >> aver-std.dat

edited Oct 19 '16 at 22:16

answered Sep 13 '13 at 12:47

Hari Menon

33,649
14
85
108

3

I only had to add that it was the quasi- standard deviation and has to divide for (NR-1). Regards and thanks. – PLM Sep 13 '13 at 15:14
@PLM, care to explain your last comment? At the `END` block, `NR` has the total number of rows, which makes this the correct solution. – shiri Jul 14 '17 at 16:10
2

@shiri It's the difference between sample standard deviation and population standard deviation. See here for details: https://en.wikipedia.org/wiki/Bessel%27s_correction – Thomas Dwyer III Dec 11 '18 at 23:51

nisetama · Answer 2 · 2019-04-20T22:24:24.150

25

To simply calculate the population standard deviation of a list of numbers, you can use a command like this:

awk '{x+=$0;y+=$0^2}END{print sqrt(y/NR-(x/NR)^2)}'

Or this calculates the sample standard deviation:

awk '{sum+=$0;a[NR]=$0}END{for(i in a)y+=(a[i]-(sum/NR))^2;print sqrt(y/(NR-1))}'

^ is in POSIX. ** is supported by gawk and nawk but not by mawk.

edited Apr 20 '19 at 22:24

answered Jan 17 '16 at 11:04

nisetama

7,764
1
34
21

2

Thanks @nisetama; I used this. For completeness, the mean can be similarly computed with the one-liner `awk '{x+=$0}END{print x/NR}'` – David Bau May 17 '17 at 12:40

score 4 · Answer 3 · edited Jun 18 '14 at 12:36

4

Here is some calculation I've made on a grinder data output file for a long soak test which had to be interrupted:

Standard deviation(biased) + average:

cat <grinder_data_file> | grep -v "1$" | awk -F ', '  '{   sum=sum+$5 ; sumX2+=(($5)^2)} END { printf "Average: %f. Standard Deviation: %f \n", sum/NR, sqrt(sumX2/(NR) - ((sum/NR)^2) )}'

Standard deviation(non-biased) + average:

cat <grinder_data_file>  | grep -v "1$" | awk -F ', '  '{   sum=sum+$5 ; sumX2+=(($5)^2)} END { avg=sum/NR; printf "Average: %f. Standard Deviation: %f \n", avg, sqrt(sumX2/(NR-1) - 2*avg*(sum/(NR-1)) + ((NR*(avg^2))/(NR-1)))}'

edited Jun 18 '14 at 12:36

Last1Here

1,099
2
10
21

answered Jun 18 '14 at 12:14

Romeu Flores

121
4

There is also another way to calculate the standard deviation taking into account that floating point arithmetics may lead to cancellation. Check for B. P. Welford (1962).["Note on a method for calculating corrected sums of squares and products"](http://www.jstor.org/stable/1266577?seq=1#page_scan_tab_contents). Technometrics 4(3):419–420 – Lars Hadidi Nov 09 '15 at 14:28

score -1 · Answer 4 · answered Sep 13 '13 at 12:35

-1

Your script should somehow be in this form instead:

awk '{
    sum = 0
    for (i=1; i<=NF; i++) {
        sum += $i
    }
    avg = sum / NF
    avga[NR] = avg
    sum = 0
    for (i=1; i<=NF; i++) {
        sum += ($i - avg) ^ 2
    }
    stda[NR] = sqrt(sum / NF)
}

END { for (i = 1; i in stda; ++i) { printf "%f %f \n", avga[i], stda[i] } }' file.dat >> aver-std.dat

answered Sep 13 '13 at 12:35

konsolebox

72,135
12
99
105

This is incorrect for the question asked, as it calculates the std dev for every **row**, not every column as the OP requested. – shiri Jul 14 '17 at 16:08

Compute average and standard deviation with awk

4 Answers4

Linked