How do I calculate the standard deviation in my shell script?

Question

I have a shell script:

dir=$1 
cd $dir 
grep -P -o '(?<=<rating>).*' * | 
awk -F: '{A[$1]+=$2;L[$1]++;next}END
{for(i in A){print i, A[i]/L[i]}}' | sort -nr -k2 | 
awk '{ sub(/.dat/, " "); print }'

which sums up all of the numbers that follow the <rating> field in each file of my folder but now I need to calculate the standard deviation of the numbers rather than getting the average. By summing up the difference of each rating in the file from the mean squared and then dividing this by the sample size -1. I do not need to do this in every file in the folder, but instead in 2 specific files, hotel_188937.dat and hotel_203921.dat. Here is an example of the contents of one of these files:

<Overall Rating>
<Avg. Price>$155
<URL>

<Author>Jeter5
<Content>I hope we're not disappointed! We enjoyed New Orleans...
<Date>Dec 19, 2008
<No. Reader>-1
<No. Helpful>-1
<rating>4
<Value>-1
<Rooms>3
<Location>5
<Cleanliness>3
<Check in / front desk>5
<Service>5
<Business service>5

<Author>...
repeat fields again...

The sample size of the first file is 127 with a mean of 4.78 compared with a sample size of 324 and a mean of 4.78 for the second file. Is there anyway that I can alter my script to calculate the standard deviation for these two specific files rather than calculating the average for every file in my directory? Thanks for your time.

Check out http://stackoverflow.com/questions/18786073/compute-average-and-standard-deviation-with-awk — bartvanraaij, Feb 25 '16 at 13:06

score 2 · Accepted Answer · answered Feb 25 '16 at 14:24

You can do all in one awk script

$ awk -F'>' '
    $1=="<rating" {k=FILENAME;sub(/.dat/,"",k);
                   s[k]+=$2;ss[k]+=$2^2;c[k]++}
               END{for(i in s) 
                   print i,m=s[i]/c[i],sqrt(ss[i]/c[i]-m^2)}' r1.dat r2.dat

r1 2.5 1.11803
r2 3 1.41421

s is for sum, ss for square sum, c for count, m for mean. Note that this computes population standard deviation not sample standard deviation. For latter you need to do some scaling adjustments with (count-1).

score 1 · Answer 2 · answered Feb 25 '16 at 13:14

1

Yes.

The * in the grep line tells it to search in all the files.

Change the line

grep -P -o '(?<=<rating>).*' * |

to

grep -P -o '(?<=<rating>).*' hotel_188937.dat hotel_203921.dat |

answered Feb 25 '16 at 13:14

neuhaus

3,886
1
10
27

How do I calculate the standard deviation in my shell script?

2 Answers2

Linked