Awk Standard deviation for each unique identifier

Question

I have the following dataset with multiple different ids in column 1 and I wish to calculate the mean and standard deviation for column 2 for each id

123456 0.1234
123456 0.5673
123456 0.0011
123456 -0.0947
123457 0.9938
123457 0.0001
123457 0.2839

I have the following code to get the mean per id but struggling to amend this to get the SD as well

awk '{sum4[$1] += $2; count4[$1]++}; END{ for (id in sum4) { print id, sum4[id]/count4[id] } }' <   want3.txt > mean_id.txt

The desired output is a file of id mean and sd

123456 0.149275 0.2926
123457 0.425933 0.5118

Any advice would be much appreciated. Thanks

Please add your desired output (no description) for that sample input to your question (no comment). — Cyrus, Sep 05 '20 at 21:36
Possible duplciate of [Compute average and standard deviation with awk](https://stackoverflow.com/questions/18786073/compute-average-and-standard-deviation-with-awk) — John1024, Sep 05 '20 at 21:37
Also possible duplicate of [https://stackoverflow.com/questions/12310075/awk-search-and-calculate-standard-deviation-different-results](https://stackoverflow.com/questions/12310075/awk-search-and-calculate-standard-deviation-different-results) — John1024, Sep 05 '20 at 21:38
Hi Its not a duplicate- both ask about SD but this is asking per unique identifier rather than by column — dcp1234, Sep 05 '20 at 22:31
[GNU datamash](https://www.gnu.org/software/datamash/) makes it trivial if you can install new programs: `datamash -W -g1 mean 2 sstdev 2 < input.txt` — Shawn, Sep 05 '20 at 23:07
Btw.: there are two different standard deviations: *Standard deviation of the population* (division by n) and *sample standard deviation* (division by n-1). In your case you search for *sample standard deviation*. — Cyrus, Sep 05 '20 at 23:25

score 2 · Answer 1 · answered Sep 06 '20 at 02:51

here is another approach which is more memory efficient but possibly less precision for large mean.

$ awk -v t=1 '{s[$1]+=$2; ss[$1]+=$2*$2; c[$1]++} 
          END {for(k in s) print k,m=s[k]/c[k],sqrt((ss[k]-m^2*c[k])/(c[k]-t))}' file

123456 0.149275 0.292628
123457 0.425933 0.51185

this computes the sample standard deviation, if you have the full distribution not just the samples you can set t=0 to get population standard deviation which will be slightly lower but for large N they are practically equivalent (within the error of margin due to measurement errors).

score 1 · Answer 2 · answered Sep 05 '20 at 23:32

With GNU awk. Derived from Ivan's answer with standard deviation of the population (division by n). I switched to sample standard deviation (division by n-1).

awk '
  {
    numrec[$1]           += 1
    sum[$1]              += $2
    array[numrec[$1]]     = $2
    array[$1,numrec[$1]]  = $2
  }
  END {
    for(w in numrec) {
      for(x=1; x<=numrec[w]; x++)
        sumsq[w] += ((array[w,x]-(sum[w]/numrec[w]))^2)
      printf("%d %.6f %.4f\n", w, sum[w]/numrec[w], sqrt(sumsq[w]/(numrec[w]-1)))
    }
  }
' file

Output:

123456 0.149275 0.2926
123457 0.425933 0.5118

Awk Standard deviation for each unique identifier

2 Answers2