0

I have a problem. I have data that consists of 500 fields in each row (500 columns) and I also have 5000 rows. I want to compute the standard deviation for each line as output Input example

3  0  2  ...(496 another values)...   1
4  1  0  ...(496 another values)...   4
1  3  0  ...(496 another values)...   2

Expected output

0.571 (std for values from the first row)
0.186 (std values from the second row)
0.612 (std values from the third row)

I found something like that, but It is not fit in my case (they compute std for each column). Compute average and standard deviation with awk

I think about compute a sum of each row to check average and then for every field std[i] += ($i - sum[i])^2, and at the end sqrt(std[i]/(500-1)), but then I must create array for every row probably (5000 arrays).

Maybe I should change rows into columns and columns into the rows?

Edit:

Yes this works fantastic

#!/bin/bash
awk 'function std1() { 
    s=0; t=0;
    for( i=1; i<=NF; i++)
        s += $i;
    mean = s / NF; 
    for (i=1; i<=NF; i++ )
        t += (mean-$i)*(mean-$i);
    return sqrt(t / s)
    }
    { print std1()}' data.txt >> std.txt
Jakub
  • 679
  • 5
  • 16
  • 1
    It's not clear why you would need more than one array, since from the problem description it seems that you can discard all the work for row 1 as soon as you move on to row 2. And you don't even really need any additional arrays, since awk will already have the fields for you in $1 .. $NF. – William Pursell Apr 27 '21 at 13:00
  • It is perfectly possible to iterate on the fields with `for (i = 1; i <= NF; i++) // use $i here` without having to create an array. – Pierre François Apr 27 '21 at 13:08

1 Answers1

2

I won't vouch for the calculation, but you could just do:

awk 'function sigma(   s,   t) { 
    for( i=1; i<=NF; i++)
        s += $i;
    mean = s / NF; 
    for (i=1; i<=NF; i++ )
        t += (mean-$i)*(mean-$i);
    return sqrt(t / NF)
    }
    { print sigma()}' input-path
mhawke
  • 84,695
  • 9
  • 117
  • 138
William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • 2
    Why is the mean calculated but not used? It should it be `mean-$1`. – mhawke Apr 27 '21 at 13:06
  • 1
    @mhawke Thanks! Precisely why I refused to vouch for the calculation! – William Pursell Apr 27 '21 at 13:08
  • Also, it might be a good idea to initialise variables s and t to 0 as these are reused in each iteration. Or you could declare them as "local" in the arg list. – mhawke Apr 27 '21 at 13:22
  • thank you so much. @mhawke I put edited script in the question with initialise variables, now it should be fine – Jakub Apr 27 '21 at 13:28
  • 1
    @Mark: I just made one other change that you might not have noticed: the result should be divided by the number of data points (`NF`), not their sum. – mhawke Apr 27 '21 at 13:34
  • @mhawke yes you're right thanks. It changes my results a lot. Thanks – Jakub Apr 27 '21 at 13:37