I'm operating on large data sets which typically have an average value and an uncertainty. Typically when published only a single digit of uncertainty is shown and the corresponding value is rounded to that decimal place. The uncertainty is then wrapped in parentheses and appended to the shortened average string.
For instance:
Avg: 101.0513213 SD: 0.33129
...would give:
101.1(3)
In practice this sounds simple enough, but it actually gets somewhat complex given that you have to first calculate the single digit standard deviation, and then use that to determine what decimal digit you're rounding the average value to. Add in the case of rounding to 10
(i.e. 0.094
rounds to 0.09
, but 0.095
rounds to 0.1
changing the digit to round to) and the fact that you're rounding and not truncating and it's somewhat onerous to implement in principle.
I have a set of BASH Script
functions that pulls it off using a mix of printf
, bc
, sed
, echo
calls. It works, but the resulting calculations are very slow. Here's a sample that you can try for yourself. You should be able to see how slow it goes:
#!/bin/bash
function CleanBC() {
echo "${1/[eE][+][0]/*10^}" | bc -l | \
sed -e 's#^\(-*\)\.#\10\.#g'
}
function Log10Float() {
echo $( CleanBC "l($1)/l(10)" )
}
function TruncateDecimal() {
echo $1 | sed -e 's#\.[0-9]*##g'
}
function PowerOf10() {
absPow=$( echo $1 | sed 's#-##g' )
if [[ $( CleanBC "$1==0" ) -eq '1' ]]; then
echo "1"
elif [[ $( CleanBC "$1>0" ) -eq '1' ]]; then
echo "1"$(printf '0%.0s' $( seq 1 $absPow ) )
elif [[ $( CleanBC "$1==-1" ) -eq '1' ]]; then
echo "0.1"
elif [[ $( CleanBC "$1<-1" ) -eq '1' ]]; then
echo "0."$(printf '0%.0s' $( seq 2 $absPow ) )"1"
fi
}
function RoundArbitraryDigit() {
pow=$( PowerOf10 $2)
offset=$( CleanBC "if ($1>=0) {0.5} else {-0.5}" )
absPow=$( echo $2 | sed -e 's#-##g' )
invPow=$( PowerOf10 $( CleanBC "$2*-1" ) )
shiftedVal=$( TruncateDecimal $( CleanBC "$invPow*$1+$offset" ) )
val=$( CleanBC "scale=15;$shiftedVal*$pow" )
echo $( printf "%.${absPow}f" $val )
}
function Flt2Int() {
RoundArbitraryDigit $1 0
}
function Round() {
for v in ${@:3}; do
div=$( CleanBC "$v / $2" )
case $( echo $1 | tr '[:lower:]' '[:upper:]' ) in
CLOSEST)
val=$( TruncateDecimal $( Flt2Int $div ) );
;;
UP)
val=$( TruncateDecimal $div );
((val++))
;;
DOWN)
val=$( TruncateDecimal $div );
;;
esac
echo $( CleanBC "$val * $2" )
done
}
function Tabulate() {
roundTo=$( Log10Float $2 )
roundTo=$( CleanBC "if ($roundTo < 0) {$roundTo -1} else {$roundTo}" )
roundTo=$( TruncateDecimal $roundTo )
roundedSD=$( RoundArbitraryDigit $2 $roundTo )
invPow=$( PowerOf10 $( CleanBC "$roundTo*-1" ) )
if [[ $( CleanBC "($invPow * $roundedSD) == 10" ) -eq '1' ]]; then
((roundTo++))
roundedSD=$( RoundArbitraryDigit $roundedSD $roundTo )
invPow=$( PowerOf10 $( CleanBC "$roundTo*-1" ) )
fi
intSD=$( CleanBC "($invPow * $roundedSD)" | sed -e 's#\.[0-9]*##g' )
pow=$( PowerOf10 $roundTo )
intSD=$( CleanBC "if ($pow > 10 ) {$pow*$intSD} else {$intSD}" )
val="$( RoundArbitraryDigit $1 $roundTo )"
if [[ $( CleanBC "$roundTo > -1" ) -eq '1' ]]; then
val=$( echo $val | sed -e 's#\.0*$##g' )
fi
echo "$val(${intSD})"
}
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
My first question is whether a specific function or chunk of code is slowing the overall calculation more than the rest.
Second, I wanted to ask advice regarding how to improve the speed of the overall code with that first answer in mind.
Third, as a more general question, I'm wondering in cases like this what tools can be used to profile a bash script and identify bottlenecks.
(Note: the CleanBC
function is used because sometimes other related functions in the use case generate scientific notation numbers, i.e. 2.41321E+05
, etc. Hence this function is necessary to keep bc
from failing -- an additional requirement in my use case.)
Thanks to the advice of @Gordon Davisson I have improved my script. To time it better and to detect another fringe case, I modified the closing lines of the old script to:
function test() {
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '1055.843516' '85.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
}
time test
With the old script I get:
real 0m12.627s
user 0m3.150s
sys 0m9.282s
The new script is:
#!/bin/bash
function CleanBC() {
val=$(bc -l <<< "${1/*[eE][+][0]/*10^}")
if [[ $val == -* ]]; then
echo "${val/#-./-0.}"
else
echo "${val/#./0.}"
fi
}
function Log10Float() {
CleanBC "l($1)/l(10)"
}
function TruncateDecimal() {
echo ${1/.*/}
}
function PowerOf10() {
case $1 in
10) echo "10000000000" ;;
9) echo "1000000000" ;;
8) echo "100000000" ;;
7) echo "10000000" ;;
6) echo "1000000" ;;
5) echo "100000" ;;
4) echo "10000" ;;
3) echo "1000" ;;
2) echo "100" ;;
1) echo "10" ;;
0) echo "1" ;;
-1) echo "0.1" ;;
-2) echo "0.01" ;;
-3) echo "0.001" ;;
-4) echo "0.0001" ;;
-5) echo "0.00001" ;;
-6) echo "0.000001" ;;
-7) echo "0.0000001" ;;
-8) echo "0.00000001" ;;
-9) echo "0.000000001" ;;
-10) echo "0.0000000001" ;;
esac
}
function RoundArbitraryDigit() {
pow=$( PowerOf10 $2 )
absPow=$2;
absPow=${absPow/#-/}
if [[ $1 == -* ]]; then
offset=-0.5
else
offset=0.5
fi
if [[ $2 == -* ]]; then
invPow=$( PowerOf10 $absPow )
elif [[ $2 == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-${2}" )
fi
shiftedVal=$( CleanBC "$invPow*$1+$offset" )
shiftedVal=${shiftedVal/.*/}
val=$( CleanBC "scale=15;$shiftedVal*$pow" )
#printf "%.${absPow}f" $val
echo $val
}
function Flt2Int() {
RoundArbitraryDigit $1 0
}
function Round() {
for v in ${@:3}; do
div=$( CleanBC "$v / $2" )
case "${1^^}" in
CLOSEST)
val=$( Flt2Int $div );
;;
UP)
#truncate the decimal
val=${div/.*/}
((val++))
;;
DOWN)
#truncate the decimal
val=${div/.*/}
;;
esac
CleanBC "$val * $2"
done
}
function Tabulate() {
roundTo=$( Log10Float $2 )
if [[ $roundTo == -* ]]; then
roundTo=$( CleanBC "$roundTo -1" )
fi
roundTo=${roundTo/.*/}
roundedSD=$( RoundArbitraryDigit $2 $roundTo )
if [[ $roundTo == -* ]]; then
invPow=$( PowerOf10 ${roundTo/#-/} )
elif [[ $roundTo == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-${roundTo}" )
fi
if [[ $( CleanBC "($invPow * $roundedSD)" ) == "10" ]]; then
((roundTo++))
roundedSD=$( RoundArbitraryDigit $roundedSD $roundTo )
if [[ $roundTo == -* ]]; then
invPow=$( PowerOf10 ${roundTo/#-/} )
elif [[ $roundTo == 0 ]]; then
invPow="1"
else
invPow=$( PowerOf10 "-${roundTo}" )
fi
fi
intSD=$( CleanBC "($invPow * $roundedSD)" | sed -e 's#\.[0-9]*##g' )
pow=$( PowerOf10 $roundTo )
if [[ $pow != 0.* ]] && [[ $pow != "1" ]]; then
intSD=$( CleanBC "$pow*$intSD" )
fi
val="$( RoundArbitraryDigit $1 $roundTo )"
if [[ $roundTo != -* ]]; then
echo "${val/.*/}(${intSD})"
else
echo "${val}(${intSD})"
fi
}
function test() {
Tabulate '-.9782000' '0.0051335'
Tabulate '105.843516' '8.7571141'
Tabulate '1055.843516' '85.7571141'
Tabulate '0.2581699' '0.0020283'
Tabulate '3.4368211' '0.0739912'
}
time test
The key difference here is that I eliminated some superfluous shell calls, replaced others with string operations (including eliminating the bc
based conditional logic). The new time is:
real 0m2.566s
user 0m0.605s
sys 0m1.619s
That's roughly a five times speedup!
While I'm still thinking of porting to a Python script (as he suggested), for now I'm pretty pleased with my result, which reduces my tabulation script runtime from around 5 hours to about an hour.