I have a shell script:
dir=$1
cd $dir
grep -P -o '(?<=<rating>).*' * |
awk -F: '{A[$1]+=$2;L[$1]++;next}END
{for(i in A){print i, A[i]/L[i]}}' | sort -nr -k2 |
awk '{ sub(/.dat/, " "); print }'
which sums up all of the numbers that follow the <rating>
field in each file of my folder but now I need to calculate the standard deviation of the numbers rather than getting the average. By summing up the difference of each rating in the file from the mean squared and then dividing this by the sample size -1. I do not need to do this in every file in the folder, but instead in 2 specific files, hotel_188937.dat
and hotel_203921.dat
. Here is an example of the contents of one of these files:
<Overall Rating>
<Avg. Price>$155
<URL>
<Author>Jeter5
<Content>I hope we're not disappointed! We enjoyed New Orleans...
<Date>Dec 19, 2008
<No. Reader>-1
<No. Helpful>-1
<rating>4
<Value>-1
<Rooms>3
<Location>5
<Cleanliness>3
<Check in / front desk>5
<Service>5
<Business service>5
<Author>...
repeat fields again...
The sample size of the first file is 127 with a mean of 4.78 compared with a sample size of 324 and a mean of 4.78 for the second file. Is there anyway that I can alter my script to calculate the standard deviation for these two specific files rather than calculating the average for every file in my directory? Thanks for your time.