I have many .txt files in a folder. They are full of statistics, and have a name that's representative of the experiment those statistics are about.
exp_1_try_1.txt
exp_1_try_2.txt
exp_1_try_3.txt
exp_2_try_1.txt
exp_2_try_2.txt
exp_other.txt
In those files, I need to find the value of a variable with a specific name, and use them to calculate some statistics: min, max, avg, std dev and median.
The variable is a decimal value and dot "." is used as a decimal separator. No scientific notation, although it would be nice to handle that as well.
#in file exp_1_try_1.txt
var1=30.523
var2=0.6
#in file exp_1_try_2.txt
var1=78.98
var2=0.4
#in file exp_1_try_3.txt
var1=78.100
var2=1.1
In order to do this, I'm using bash. Here's an old script I made before my bash skills got rusty. It calculates the average of an integer value.
#!/bin/bash
folder=$1
varName="nHops"
cd "$folder"
grep -r -n -i --include="*_out.txt" "$varName" . | sed -E 's/(.+'"$varName"'=([0-9]+))|.*/\2/' | awk '{count1+=$1; count2+=$1+1}END{print "avg hops:",count1/NR; print "avg path length:",count2/NR}' RS="\n"
I'd like to modify this script to:
- support finding decimal values of variable length
- calculate more statistics
In particular std dev and median may require special attention.
Update: Here's my try to solve the problem using only UNIX tools, partially inspired by this answer. It works fine, except it does not calculate the standard deviation. The chosen answer uses Perl and is probably much faster.
#!/bin/bash
folder=$1
varName="var1"
cd "$folder"
grep -r -n -i --include="exp_1_run_*" "$varName" . | sed -E 's/(.+'"$varName"'=([0-9]+(\.[0-9]*)?))/\2/' | sort -n | awk '
BEGIN {
count = 0;
sum = 0;
}
{
a[count++] = $1;
sum += $1;
}
END {
avg = sum / count;
if( (count % 2) == 1 ) {
median = a[ int(count/2) ];
} else {
median = ( a[count/2] + a[count/2-1] ) / 2;
}
OFS="\t";
OFMT="%.6f";
print avg, median, a[0], a[count-1];
}
'