0

I have the following dataset with multiple different ids in column 1 and I wish to calculate the mean and standard deviation for column 2 for each id

123456 0.1234
123456 0.5673
123456 0.0011
123456 -0.0947
123457 0.9938
123457 0.0001
123457 0.2839

I have the following code to get the mean per id but struggling to amend this to get the SD as well

awk '{sum4[$1] += $2; count4[$1]++}; END{ for (id in sum4) { print id, sum4[id]/count4[id] } }' <   want3.txt > mean_id.txt

The desired output is a file of id mean and sd

123456 0.149275 0.2926
123457 0.425933 0.5118

Any advice would be much appreciated. Thanks

dcp1234
  • 65
  • 6
  • Please add your desired output (no description) for that sample input to your question (no comment). – Cyrus Sep 05 '20 at 21:36
  • Possible duplciate of [Compute average and standard deviation with awk](https://stackoverflow.com/questions/18786073/compute-average-and-standard-deviation-with-awk) – John1024 Sep 05 '20 at 21:37
  • Also possible duplicate of [https://stackoverflow.com/questions/12310075/awk-search-and-calculate-standard-deviation-different-results](https://stackoverflow.com/questions/12310075/awk-search-and-calculate-standard-deviation-different-results) – John1024 Sep 05 '20 at 21:38
  • Hi Its not a duplicate- both ask about SD but this is asking per unique identifier rather than by column – dcp1234 Sep 05 '20 at 22:31
  • [GNU datamash](https://www.gnu.org/software/datamash/) makes it trivial if you can install new programs: `datamash -W -g1 mean 2 sstdev 2 < input.txt` – Shawn Sep 05 '20 at 23:07
  • Btw.: there are two different standard deviations: *Standard deviation of the population* (division by n) and *sample standard deviation* (division by n-1). In your case you search for *sample standard deviation*. – Cyrus Sep 05 '20 at 23:25

2 Answers2

2

here is another approach which is more memory efficient but possibly less precision for large mean.

$ awk -v t=1 '{s[$1]+=$2; ss[$1]+=$2*$2; c[$1]++} 
          END {for(k in s) print k,m=s[k]/c[k],sqrt((ss[k]-m^2*c[k])/(c[k]-t))}' file

123456 0.149275 0.292628
123457 0.425933 0.51185

this computes the sample standard deviation, if you have the full distribution not just the samples you can set t=0 to get population standard deviation which will be slightly lower but for large N they are practically equivalent (within the error of margin due to measurement errors).

karakfa
  • 66,216
  • 7
  • 41
  • 56
1

With GNU awk. Derived from Ivan's answer with standard deviation of the population (division by n). I switched to sample standard deviation (division by n-1).

awk '
  {
    numrec[$1]           += 1
    sum[$1]              += $2
    array[numrec[$1]]     = $2
    array[$1,numrec[$1]]  = $2
  }
  END {
    for(w in numrec) {
      for(x=1; x<=numrec[w]; x++)
        sumsq[w] += ((array[w,x]-(sum[w]/numrec[w]))^2)
      printf("%d %.6f %.4f\n", w, sum[w]/numrec[w], sqrt(sumsq[w]/(numrec[w]-1)))
    }
  }
' file

Output:

123456 0.149275 0.2926
123457 0.425933 0.5118
Cyrus
  • 84,225
  • 14
  • 89
  • 153