Compute standard deviationfor each row in awk

Question

I have a problem. I have data that consists of 500 fields in each row (500 columns) and I also have 5000 rows. I want to compute the standard deviation for each line as output Input example

3  0  2  ...(496 another values)...   1
4  1  0  ...(496 another values)...   4
1  3  0  ...(496 another values)...   2

Expected output

0.571 (std for values from the first row)
0.186 (std values from the second row)
0.612 (std values from the third row)

I found something like that, but It is not fit in my case (they compute std for each column). Compute average and standard deviation with awk

I think about compute a sum of each row to check average and then for every field std[i] += ($i - sum[i])^2, and at the end sqrt(std[i]/(500-1)), but then I must create array for every row probably (5000 arrays).

Maybe I should change rows into columns and columns into the rows?

Edit:

Yes this works fantastic

#!/bin/bash
awk 'function std1() { 
    s=0; t=0;
    for( i=1; i<=NF; i++)
        s += $i;
    mean = s / NF; 
    for (i=1; i<=NF; i++ )
        t += (mean-$i)*(mean-$i);
    return sqrt(t / s)
    }
    { print std1()}' data.txt >> std.txt

It's not clear why you would need more than one array, since from the problem description it seems that you can discard all the work for row 1 as soon as you move on to row 2. And you don't even really need any additional arrays, since awk will already have the fields for you in $1 .. $NF. — William Pursell, Apr 27 '21 at 13:00
It is perfectly possible to iterate on the fields with `for (i = 1; i <= NF; i++) // use $i here` without having to create an array. — Pierre François, Apr 27 '21 at 13:08

score 2 · Accepted Answer · edited Apr 27 '21 at 13:29

2

I won't vouch for the calculation, but you could just do:

awk 'function sigma(   s,   t) { 
    for( i=1; i<=NF; i++)
        s += $i;
    mean = s / NF; 
    for (i=1; i<=NF; i++ )
        t += (mean-$i)*(mean-$i);
    return sqrt(t / NF)
    }
    { print sigma()}' input-path

edited Apr 27 '21 at 13:29

mhawke

84,695
9
117
138

answered Apr 27 '21 at 12:58

William Pursell

204,365
48
270
300

2

Why is the mean calculated but not used? It should it be `mean-$1`. – mhawke Apr 27 '21 at 13:06
1

@mhawke Thanks! Precisely why I refused to vouch for the calculation! – William Pursell Apr 27 '21 at 13:08
Also, it might be a good idea to initialise variables s and t to 0 as these are reused in each iteration. Or you could declare them as "local" in the arg list. – mhawke Apr 27 '21 at 13:22
thank you so much. @mhawke I put edited script in the question with initialise variables, now it should be fine – Jakub Apr 27 '21 at 13:28
1

@Mark: I just made one other change that you might not have noticed: the result should be divided by the number of data points (`NF`), not their sum. – mhawke Apr 27 '21 at 13:34
@mhawke yes you're right thanks. It changes my results a lot. Thanks – Jakub Apr 27 '21 at 13:37

Compute standard deviationfor each row in awk

1 Answers1