0

I'm operating on large data sets which typically have an average value and an uncertainty. Typically when published only a single digit of uncertainty is shown and the corresponding value is rounded to that decimal place. The uncertainty is then wrapped in parentheses and appended to the shortened average string.

For instance:

Avg: 101.0513213 SD: 0.33129

...would give:

101.1(3)

In practice this sounds simple enough, but it actually gets somewhat complex given that you have to first calculate the single digit standard deviation, and then use that to determine what decimal digit you're rounding the average value to. Add in the case of rounding to 10 (i.e. 0.094 rounds to 0.09, but 0.095 rounds to 0.1 changing the digit to round to) and the fact that you're rounding and not truncating and it's somewhat onerous to implement in principle.

I have a set of BASH Script functions that pulls it off using a mix of printf, bc, sed, echo calls. It works, but the resulting calculations are very slow. Here's a sample that you can try for yourself. You should be able to see how slow it goes:

#!/bin/bash

function CleanBC() {
    echo "${1/[eE][+][0]/*10^}" | bc -l | \
        sed -e 's#^\(-*\)\.#\10\.#g'
}

function Log10Float() {
    echo $( CleanBC "l($1)/l(10)" )
}

function TruncateDecimal() {
    echo $1 | sed -e 's#\.[0-9]*##g' 
}

function PowerOf10() {
    absPow=$( echo $1 | sed 's#-##g' )
    if [[ $( CleanBC "$1==0" ) -eq '1' ]]; then
        echo "1"
    elif [[ $( CleanBC "$1>0" ) -eq '1' ]]; then
        echo "1"$(printf '0%.0s' $( seq 1 $absPow ) )
    elif [[ $( CleanBC "$1==-1" ) -eq '1' ]]; then
        echo "0.1"
    elif [[ $( CleanBC "$1<-1" ) -eq '1' ]]; then
        echo "0."$(printf '0%.0s' $( seq 2 $absPow ) )"1"
    fi
}

function RoundArbitraryDigit() {
    pow=$( PowerOf10 $2)
    offset=$( CleanBC "if ($1>=0) {0.5} else {-0.5}" )
    absPow=$( echo $2 | sed -e 's#-##g' )
    invPow=$( PowerOf10 $( CleanBC "$2*-1" ) )
    shiftedVal=$( TruncateDecimal $( CleanBC "$invPow*$1+$offset" ) )
    val=$( CleanBC "scale=15;$shiftedVal*$pow" )
    echo $( printf "%.${absPow}f" $val )
}

function Flt2Int() {
    RoundArbitraryDigit $1 0 
}

function Round() {
    for v in ${@:3}; do
        div=$( CleanBC "$v / $2" )

        case $( echo $1 | tr '[:lower:]' '[:upper:]' ) in
            CLOSEST)
                val=$( TruncateDecimal $( Flt2Int $div ) );
                ;;
            UP)
                val=$( TruncateDecimal $div );
                ((val++))
                ;;
            DOWN)
                val=$( TruncateDecimal $div );
                ;;
        esac
        echo $( CleanBC "$val * $2" )
    done
}

function Tabulate() {
    roundTo=$( Log10Float $2 )
    roundTo=$( CleanBC "if ($roundTo < 0) {$roundTo -1} else {$roundTo}" )
    roundTo=$( TruncateDecimal $roundTo )
    roundedSD=$( RoundArbitraryDigit $2 $roundTo )
    invPow=$( PowerOf10 $( CleanBC "$roundTo*-1" ) )
    if [[ $( CleanBC "($invPow * $roundedSD) == 10" ) -eq '1' ]]; then
        ((roundTo++))
        roundedSD=$( RoundArbitraryDigit $roundedSD $roundTo )
        invPow=$( PowerOf10 $( CleanBC "$roundTo*-1" ) )
    fi
    intSD=$( CleanBC "($invPow * $roundedSD)" | sed -e 's#\.[0-9]*##g' )
    pow=$( PowerOf10 $roundTo )
    intSD=$( CleanBC "if ($pow > 10 ) {$pow*$intSD} else {$intSD}" )
    val="$( RoundArbitraryDigit $1 $roundTo )"
    if  [[ $( CleanBC "$roundTo > -1" ) -eq '1' ]]; then
        val=$( echo $val | sed -e 's#\.0*$##g' )
    fi
    echo "$val(${intSD})"
}

Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'

My first question is whether a specific function or chunk of code is slowing the overall calculation more than the rest.

Second, I wanted to ask advice regarding how to improve the speed of the overall code with that first answer in mind.

Third, as a more general question, I'm wondering in cases like this what tools can be used to profile a bash script and identify bottlenecks.

(Note: the CleanBC function is used because sometimes other related functions in the use case generate scientific notation numbers, i.e. 2.41321E+05, etc. Hence this function is necessary to keep bc from failing -- an additional requirement in my use case.)


Thanks to the advice of @Gordon Davisson I have improved my script. To time it better and to detect another fringe case, I modified the closing lines of the old script to:

function test() {
    Tabulate '-.9782000' '0.0051335'
    Tabulate '105.843516' '8.7571141'
    Tabulate '1055.843516' '85.7571141'
    Tabulate '0.2581699' '0.0020283'
    Tabulate '3.4368211' '0.0739912'
}

time test

With the old script I get:

real    0m12.627s
user    0m3.150s
sys     0m9.282s

The new script is:

#!/bin/bash

function CleanBC() {
    val=$(bc -l <<< "${1/*[eE][+][0]/*10^}")
    if [[ $val == -* ]]; then
        echo "${val/#-./-0.}"
    else
        echo "${val/#./0.}"
    fi
}

function Log10Float() {
    CleanBC "l($1)/l(10)"
}

function TruncateDecimal() {
    echo ${1/.*/}
}

function PowerOf10() {
    case $1 in
        10) echo "10000000000" ;;
        9) echo "1000000000" ;;
        8) echo "100000000" ;;
        7) echo "10000000" ;;
        6) echo "1000000" ;;
        5) echo "100000" ;;
        4) echo "10000" ;;
        3) echo "1000" ;;
        2) echo "100" ;;
        1) echo "10" ;;
        0) echo "1" ;;
        -1) echo "0.1" ;;
        -2) echo "0.01" ;;
        -3) echo "0.001" ;;
        -4) echo "0.0001" ;;
        -5) echo "0.00001" ;;
        -6) echo "0.000001" ;;
        -7) echo "0.0000001" ;;
        -8) echo "0.00000001" ;;
        -9) echo "0.000000001" ;;
        -10) echo "0.0000000001" ;;
    esac
}

function RoundArbitraryDigit() {
    pow=$( PowerOf10 $2 )
    absPow=$2;
    absPow=${absPow/#-/}
    if [[ $1 == -* ]]; then
        offset=-0.5
    else
        offset=0.5
    fi
    if [[ $2 == -* ]]; then
        invPow=$( PowerOf10 $absPow )
    elif [[ $2 == 0 ]]; then
        invPow="1"
    else
        invPow=$( PowerOf10 "-${2}" )
    fi
    shiftedVal=$( CleanBC "$invPow*$1+$offset" )
    shiftedVal=${shiftedVal/.*/}
    val=$( CleanBC "scale=15;$shiftedVal*$pow" )
    #printf "%.${absPow}f" $val
    echo $val
}

function Flt2Int() {
    RoundArbitraryDigit $1 0 
}

function Round() {
    for v in ${@:3}; do
        div=$( CleanBC "$v / $2" )

        case "${1^^}" in
            CLOSEST)
                val=$( Flt2Int $div );
                ;;
            UP)
                #truncate the decimal
                val=${div/.*/}
                ((val++))
                ;;
            DOWN)
                #truncate the decimal
                val=${div/.*/}
                ;;
        esac
        CleanBC "$val * $2"
    done
}

function Tabulate() {
    roundTo=$( Log10Float $2 )
    if [[ $roundTo == -* ]]; then
        roundTo=$( CleanBC "$roundTo -1" )
    fi
    roundTo=${roundTo/.*/}
    roundedSD=$( RoundArbitraryDigit $2 $roundTo )
    if [[ $roundTo == -* ]]; then
        invPow=$( PowerOf10 ${roundTo/#-/} )
    elif [[ $roundTo == 0 ]]; then
        invPow="1"
    else
        invPow=$( PowerOf10 "-${roundTo}" )
    fi
    if [[ $( CleanBC "($invPow * $roundedSD)" ) == "10" ]]; then
        ((roundTo++))
        roundedSD=$( RoundArbitraryDigit $roundedSD $roundTo )
        if [[ $roundTo == -* ]]; then
            invPow=$( PowerOf10 ${roundTo/#-/} )
        elif [[ $roundTo == 0 ]]; then
            invPow="1"
        else
            invPow=$( PowerOf10 "-${roundTo}" )
        fi
    fi
    intSD=$( CleanBC "($invPow * $roundedSD)" | sed -e 's#\.[0-9]*##g' )
    pow=$( PowerOf10 $roundTo )
    if [[ $pow != 0.* ]] && [[ $pow != "1" ]]; then
        intSD=$( CleanBC "$pow*$intSD" )
    fi
    val="$( RoundArbitraryDigit $1 $roundTo )"
    if [[ $roundTo != -* ]]; then
        echo "${val/.*/}(${intSD})"
    else
        echo "${val}(${intSD})"
    fi
}

function test() {
    Tabulate '-.9782000' '0.0051335'
    Tabulate '105.843516' '8.7571141'
    Tabulate '1055.843516' '85.7571141'
    Tabulate '0.2581699' '0.0020283'
    Tabulate '3.4368211' '0.0739912'
}

time test

The key difference here is that I eliminated some superfluous shell calls, replaced others with string operations (including eliminating the bc based conditional logic). The new time is:

real    0m2.566s
user    0m0.605s
sys     0m1.619s

That's roughly a five times speedup!

While I'm still thinking of porting to a Python script (as he suggested), for now I'm pretty pleased with my result, which reduces my tabulation script runtime from around 5 hours to about an hour.

Jason R. Mick
  • 5,177
  • 4
  • 40
  • 69
  • 1
    This might help: http://stackoverflow.com/questions/5014823/how-to-profile-a-bash-shell-script. – cdarke Jul 06 '16 at 15:44
  • 2
    Having spent some time looking at this code, sure, there are some tricks to reduce the number of times you call `sed` (by using shell substitution), but really I have to question if `bash` is a sensible tool for this application. Korn shell, for example, supports floating point (since '93), but a language like python would make this so much simpler - and faster. In particular see the `statistics` module https://docs.python.org/3/library/statistics.html – cdarke Jul 06 '16 at 16:00

1 Answers1

1

The best solution would be to write the script in a language that actually supports native floating point math, i.e. pretty much anything except bash. I'll second cdarke's recommendation of Python, but realistically almost anything would be better than a shell script.

The biggest reason for this is that the shell doesn't actually have much capability itself; what it's really good at is launching other programs (bc, sed, etc) to do the actual work. But launching a program is really computationally expensive. In the shell, anything involving an external command, a pipe, or $( ... ) (or its backtick equivalent) will need to create a new process, and that's a huge amount of overhead in the middle of what should be some simple computation. Compare these two snippets:

for ((num=1; num<10000; num++)); do
    result="$(echo "$i" | sed 's#0##g')"
done

for ((num=1; num<10000; num++)); do
    result="${num//0/}"
done

They both do the same thing (loop through all numbers from 1 to 10,000, then set result to the number with all "0"s removed). On my computer, the first took 36 seconds, while the second took 0.2 seconds. The reason is simple: in the second, everything is done directly in bash, with no need to create additional processes. The first, on the other hand, has to create a subshell (i.e. another process running bash) to run the contents of $( ... ), then another subshell to do the echo, then process running sed to do the substitution. That's three process creations (and exit/cleanups) the computer has to execute every time through the loop. And that's why the first is over 100 times slower than the second.

Consider a third snippet:

TrimZeroes() {
    echo "${1//0/}"
}

for ((num=1; num<10000; num++)); do
    result="$(TrimZeroes "$num")"
done

Looks like a cleaner (better abstracted) version of the second, right? It took 8 seconds, because the $( ... ) required creating a subshell to run TrimZeroes in.

Now, look at one line in your script:

roundTo=$( Log10Float $2 )

This creates a subshell to run Log10Float in. That consists of the single line

echo $( CleanBC "l($1)/l(10)" )

...which creates another subshell to run CleanBC in, which does:

echo "${1/[eE][+][0]/*10^}" | bc -l | sed -e 's#^\(-*\)\.#\10\.#g'

...which creates three more processes, one for each part of the pipeline. That's a total of five processes to take one logarithm!

So, there are a number of things you could do to speed up the script: mostly switching to using bash's builtin string manipulation capabilities, and inlining as many as possible of the function calls. But this'll make the script even messier than it already is, and it'll still be far slower than if it was written in a more appropriate language.

Python is nice. Ruby is nice. Even perl would be much better than shell for this.

Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151