Bash - Find variable in many .txt files and calculate statistics

Question

I have many .txt files in a folder. They are full of statistics, and have a name that's representative of the experiment those statistics are about.

exp_1_try_1.txt
exp_1_try_2.txt
exp_1_try_3.txt

exp_2_try_1.txt
exp_2_try_2.txt

exp_other.txt

In those files, I need to find the value of a variable with a specific name, and use them to calculate some statistics: min, max, avg, std dev and median.

The variable is a decimal value and dot "." is used as a decimal separator. No scientific notation, although it would be nice to handle that as well.

#in file exp_1_try_1.txt
var1=30.523
var2=0.6

#in file exp_1_try_2.txt
var1=78.98
var2=0.4

#in file exp_1_try_3.txt
var1=78.100
var2=1.1

In order to do this, I'm using bash. Here's an old script I made before my bash skills got rusty. It calculates the average of an integer value.

#!/bin/bash

folder=$1
varName="nHops"

cd "$folder"
grep -r -n -i --include="*_out.txt" "$varName" . | sed -E 's/(.+'"$varName"'=([0-9]+))|.*/\2/' | awk '{count1+=$1; count2+=$1+1}END{print "avg hops:",count1/NR; print "avg path length:",count2/NR}' RS="\n"

I'd like to modify this script to:

support finding decimal values of variable length
calculate more statistics

In particular std dev and median may require special attention.

Update: Here's my try to solve the problem using only UNIX tools, partially inspired by this answer. It works fine, except it does not calculate the standard deviation. The chosen answer uses Perl and is probably much faster.

#!/bin/bash

folder=$1
varName="var1"

cd "$folder"
grep -r -n -i --include="exp_1_run_*" "$varName" . | sed -E 's/(.+'"$varName"'=([0-9]+(\.[0-9]*)?))/\2/' | sort -n | awk '
  BEGIN {
    count = 0;
    sum = 0;
  }
  {
    a[count++] = $1;
    sum += $1;
  }
  END {
    avg = sum / count;
    if( (count % 2) == 1 ) {
      median = a[ int(count/2) ];
    } else {
      median = ( a[count/2] + a[count/2-1] ) / 2;
    }
    OFS="\t";
    OFMT="%.6f";
    print avg, median, a[0], a[count-1];
  }
'

Don't have time to offer a more complete answer, but I think you'll find some useful pointers here: http://stackoverflow.com/q/9789806/143319 — Matt Parker, Jan 26 '15 at 16:59

glenn jackman · Accepted Answer · 2015-01-26T22:42:36.910

2

To extract just the values, use the -o and -P grep options:

grep -rioPh --include="*_out.txt" "(?<=${varName}=)[\d.]+" .

That looks for a pattern like nHops=1.234 and just prints out 1.234

Given your sample data:

$ var="var1"
$ grep -oPh "(?<=$var=)[\d.]+" exp_1_try_{1,2,3}.txt 
30.523
78.98
78.100

To output some stats, you should be able to pipe those numbers into your favourite stats program. Here's an example:

grep -oPh "(?<=$var=)[\d.]+" f? | 
perl -MStatistics::Basic=:all -le '
    @data = <>; 
    print "mean: ", mean(@data);
    print "median: ", median(@data);
    print "stddev: ", stddev(@data)
'

mean: 62.53
median: 78.1
stddev: 22.64

Of course, since this is perl, we don't need grep or sed at all:

perl -MStatistics::Basic=:all -MList::Util=min,max -lne '
        /'"$var"'\s*=\s*(\d+\.?\d*)/ and push @data, $1
    } END {
        print "mean: ", mean(@data);
        print "median: ", median(@data);
        print "stddev: ", stddev(@data);
        print "min: ", min(@data);
        print "max: ", max(@data);
' exp_1_try_*

mean: 62.53
median: 78.1
stddev: 22.64
min: 30.523
max: 78.98

edited Jan 26 '15 at 22:42

answered Jan 26 '15 at 18:06

glenn jackman

238,783
38
220
352

I tried to make a pattern for an arbitrary number of spaces around the equals, but I get the error "lookbehind assertion is not fixed length. Here's my code `grep -rioPh --include="exp_1_run_*" "(?<=${varName}\s*=\s*\K)[\d]+\.?[\d]*` – Agostino Jan 26 '15 at 21:25
If you use `\K`, you don't need the look-behind. `grep ... "${varName}\s*=\s*\K[\d]+\.?[\d]*"` -- may have to double the backslashes in double quotes – glenn jackman Jan 26 '15 at 21:37
You can do away with sed and grep altogether – glenn jackman Jan 26 '15 at 22:42
1

I used sprintf to round the number to a specific precision before printing it. Here's the trick `print "mean: ", sprintf("mean: %.6f", mean(@data))` – Agostino Jan 27 '15 at 10:22

Bash - Find variable in many .txt files and calculate statistics

1 Answers1