6

Background

I work for a research institute that studies storm surges computationally, and am attempting to automate some of the HPC commands using Bash. Currently, the process is we download the data from NOAA and create the command file manually, line-by-line, inputting the location of each file along with a time for the program to read the data from that file and a wind magnification factor. There are hundreds of these data files in each download NOAA produces, which come out every 6 hours or so when a storm is in progress. This means that much of our time during a storm is spent making these command files.

Problem

I am limited in the tools I can use to automate this process because I simply have a user account and a monthly allotment of time on the supercomputers; I do not have the privilege to install new software on them. Plus, some of them are Crays, some are IBMs, some are HPs, and so forth. There isn't a consistent operating system between them; the only similarity is they are all Unix-based. So I have at my disposal tools like Bash, Perl, awk, and Python, but not necessarily tools like csh, ksh, zsh, bc, et cetera:

$ bc
-bash: bc: command not found

Further, my lead scientist has requested that all of the code I write for him be in Bash because he understands it, with minimal calls to external programs for things Bash cannot do. For example, it cannot do floating point arithmetic, and I need to be able to add floats. I can call Perl from within Bash, but that's slow:

$ time perl -E 'printf("%.2f", 360.00 + 0.25)'
360.25
real    0m0.052s
user    0m0.015s
sys     0m0.015s

1/20th of a second doesn't seem like a long time, but when I have to make this call 100 times in a single file, that equates to about 5 seconds to process one file. That isn't so bad when we are only making one of these every 6 hours. However, if this work is abstracted to a larger assignment, one where we point 1,000 synthetic storms at the Atlantic basin at one time in order to study what could have happened had the storm been stronger or taken a different path, 5 seconds quickly grows to more than an hour just to process text files. When you are billed by the hour, this poses a problem.

Question

What is a good way to speed this up? I currently have this for loop in the script (the one that takes 5 seconds to run):

for FORECAST in $DIRNAME; do
    echo $HOURCOUNT"  "$WINDMAG"  "${FORECAST##*/} >> $FILENAME;
    HOURCOUNT=$(echo "$HOURCOUNT $INCREMENT" | awk '{printf "%.2f", $1 + $2}');
done

I know a single call to awk or Perl to loop through the data files would be a hundred times faster than calling either once for each file in the directory, and that these languages can easily open a file and write to it, but the problem I am having is getting data back and forth. I have found a lot of resources on these three languages alone (awk, Perl, Python), but haven't been able to find as much on embedding them in a Bash script. The closest I have been able to come is to make this shell of an awk command:

awk -v HOURCOUNT="$HOURCOUNT" -v INCREMENT="$INCREMENT" -v WINDMAG="$WINDMAG" -v DIRNAME="$DIRNAME" -v FILENAME="$FILENAME" 'BEGIN{ for (FORECAST in DIRNAME) do
    ...
}'

But I am not certain that this is correct syntax, and if it is, if it's the best way to go about this, or if it will even work at all. I have been hitting my head against the wall for a few days now and decided to ask the internet before I plug on.

halfer
  • 19,824
  • 17
  • 99
  • 186
Jonathan E. Landrum
  • 2,748
  • 4
  • 30
  • 46
  • 7
    If you have Perl and Python available, why don't you write your scripts entirely in them? The inefficiency you saw comes from having to start up the entire Perl interpreter just for one statement. If you have a Perl script with 50-100 lines, it will be very efficient because the startup and parsing cost is amortized. – Barmar Jul 02 '14 at 19:04
  • Because the work is already done, besides the inefficiency. I would have to start over. Further, my PI prefers I write this in Bash. I will edit the question to include that information. – Jonathan E. Landrum Jul 02 '14 at 19:06
  • 3
    One possibility is to start up a Perl coprocess. Then you can feed floating point expressions to it and it will send back the result. – Barmar Jul 02 '14 at 19:26
  • @Barmar wow, that actually looks like a great idea. I'd never heard of that before. I will try that and comment back, but it sounds like exactly what I need. – Jonathan E. Landrum Jul 02 '14 at 19:31
  • Can you use a template to aggregate many/all of the data in these files into fewer external calls? for example, create an array of the contents of multiple files in some structured format, then call perl with the contents of the array. – dawg Jul 02 '14 at 20:06
  • bash allows loadable modules -- if you want to add a new builtin that does the floating-point math you need, you can write that in C and load it at runtime. – Charles Duffy Jul 02 '14 at 20:15
  • 1
    *"When you are billed by the hour, this poses a problem."* I think you may be able to make a good [business case](http://en.wikipedia.org/wiki/Business_case) to your PI for using Perl or Python. – Andrew Morton Jul 02 '14 at 20:18
  • 1
    @halfer, yes, PI is Principle Investigator, the lead scientist on a project. – Jonathan E. Landrum Jul 02 '14 at 20:37
  • 1
    I wonder if you can write Perl scripts that look enough like bash that your PI will be able to understand them. In fact it might be easier to understand than a bash script littered with stuff like `perl -E 'printf("%.2f", 360.00 + 0.25)'`. – David K Jul 02 '14 at 21:10
  • Using the time data I gathered, I was able to convince my PI to let me rewrite the script in Python. The time to execute went from about 5 seconds down to about a third of a second. Thank you all for nudging me enough to rewrite it using the correct tool for the job. – Jonathan E. Landrum Jul 07 '14 at 20:16
  • Has the PI ever looked at Python code? Writing numerical computation in Bash syntax makes about as much sense as writing an operating system in Fortran. – Jeff Hammond Jan 05 '15 at 05:59

3 Answers3

3

Bash is very capable as long as you have the ability you need. For floating point, you basically have two options, either bc (which at least on the box you show isn't installed [which is kind of hard to believe]) or calc. calc-2.12.4.13.tar.bz2

Either package is flexible and very capable floating-point programs that integrate well with bash. Since the powers that be have a preference for bash, I would investigate installing either bc or calc. (job security is a good thing)

If your superiors can be convinced to allow either perl or python, then either will do. If you have never programmed in either, both will have a learning curve, python slightly more so than perl. If you superiors there can read bash, then translating perl would be much easier to digest for them than python.

This is a fair outline of the options you have given your situation as you've explained it. Regardless of your choice, the task for you should not be that daunting in any of the languages. Just drop a line back when you get stuck.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • Yes, I wonder if searching each box for `bc` might be worthwhile - could this just have dropped out of the path? – halfer Jul 02 '14 at 20:24
  • I would have much preferred to write all of this in csh; that is the shell I am most comfortable in. I will ask to see if we can get bc installed on the machine I tested. – Jonathan E. Landrum Jul 02 '14 at 20:35
  • It is usually installed in `/usr/bin` so unless you have completely lost your executable path typing `bc` should work. If for some strange reason, the permissions on `bc` are mucked up, it would not show as executable, but a `ls -al` of your executable path would find it. Check your path with `set | grep ^PATH` and go from there. – David C. Rankin Jul 02 '14 at 20:35
  • 1
    Using the time data I gathered, I was able to convince my PI to let me rewrite the script in Python. The time to execute went from about 5 seconds down to about a third of a second. Thank you for nudging me enough to rewrite it using the correct tool for the job. – Jonathan E. Landrum Jul 07 '14 at 20:16
1

Starting awk or another command just to do a single addition is never going to be efficient. Bash can't handle floats, so you need to shift your perspective. You say you only need to add floats, and I gather these floats represent a duration in hours. So use seconds instead.

for FORECAST in $DIRNAME; do
    printf "%d.%02d  %s  %s\n" >> $FILENAME \
        $((SECONDCOUNT / 3600)) \
        $(((SECONDCOUNT % 3600) * 100 / 3600)) \
        $WINDMAG \
        ${FORECAST##*/}

    SECONDCOUNT=$((SECONDCOUNT + $SECONDS_INCREMENT))
done

(printf is standard and much nicer than echo for formatted output)

EDIT: Abstracted as a function and with a bit of demonstration code:

function format_as_hours {
    local seconds=$1
    local hours=$((seconds / 3600))
    local fraction=$(((seconds % 3600) * 100 / 3600))
    printf '%d.%02d' $hours $fraction
}

# loop for 0 to 2 hours in 5 minute steps
for ((i = 0; i <= 7200; i += 300)); do
    format_as_hours $i
    printf "\n"
done
pdw
  • 8,359
  • 2
  • 29
  • 41
  • Would this not pose a problem if `$((SECONDCOUNT / 3600))` was fractional? – Jonathan E. Landrum Jul 02 '14 at 19:40
  • Bash will discard any fractional part, just like integer division in C. – pdw Jul 02 '14 at 19:47
  • Then that doesn't do me any favors. I have to maintain the fractional component. – Jonathan E. Landrum Jul 08 '14 at 16:07
  • That was what I was trying to show -- if you choose your base unit small enough (seconds, milliseconds, whatever), you can do fractional calculations using only integers, and still output the results as a proper floating point number. – pdw Jul 08 '14 at 16:17
  • I can certainly see that, and seconds works perfectly. However, when converting back to hours (which the software we use depends on; this isn't our design) won't it discard the fractional component? – Jonathan E. Landrum Jul 08 '14 at 16:18
  • 1
    Doesn't my example demonstrate that it works? Basically, first I calculate the whole part, then in a separate calculation the fractional part, and finally I use printf to print it with proper formatting. // Though I see from you other comments that you've already rewritten the program in Python. That's certainly the technically better solution. I just wanted to show that, if was unavoidable, the task could be done using only integer arithmetic. – pdw Jul 08 '14 at 16:49
  • 1
    Yes, after reading your edit I can see that you were right all along. My bad. Bash isn't my strong suit. – Jonathan E. Landrum Jul 08 '14 at 16:56
-2

If all these computers are unices, and they are expected to perform floating point computations, then each of them must have some fp capable app available. So a compound compound command along the lines of bc -l some-comp || dc some-comp || ... || perl some comp

user985675
  • 293
  • 3
  • 10
  • Or perhaps even `echo "$HOURCOUNT $INCREMENT" | awk '{printf "%.2f", $1 + $2}'`. The problem is time. I can pipe output from awk into Bash, but it takes forever. – Jonathan E. Landrum Jul 02 '14 at 19:34