2

When I do this with awk it's relatively fast, even though it's Row By Agonizing Row (RBAR). I tried to make a quicker more elegant bug resistant solution in Bash that would only have to make far fewer passes through the file. It takes probably 10 seconds to get through the first 1,000 lines with bash using this code. I can make 25 passes through all million lines of file with awk in about the same time! How come bash is several orders of magnitude slower?

  while read line
    do
    FIELD_1=`echo "$line" | cut -f1`
    FIELD_2=`echo "$line" | cut -f2`

    if [ "$MAIN_REF" == "$FIELD_1" ]; then
      #echo "$line"
      if [ "$FIELD_2" == "$REF_1" ]; then
         ((REF_1_COUNT++))
      fi

      ((LINE_COUNT++))

      if [ "$LINE_COUNT" == "1000" ]; then
        echo $LINE_COUNT;
      fi
    fi
done < temp/refmatch
codeforester
  • 39,467
  • 16
  • 112
  • 140
Jon17
  • 73
  • 6
  • 1
    Because `read` is slow - https://stackoverflow.com/questions/13762625/bash-while-read-line-extremely-slow-compared-to-cat-why Why not write this in Python? There's nothing here that needs to be bash-specific. – Chase Oct 26 '18 at 03:26
  • 1
    Your code creates too many processes and that would make it even slower. awk or python is your best bet. – codeforester Oct 26 '18 at 03:29

2 Answers2

3

Bash is slow. That's just the way it is; it's designed to oversee the execution of specific tools, and it was never optimized for performance.

All the same, you can make it less slow by avoiding obvious inefficiencies. For example, read will split its input into separate words, so it would be both faster and clearer to write:

while read -r field1 field2 rest; do
  # Do something with field1 and field2

instead of

while read line
    do
    FIELD_1=`echo "$line" | cut -f1`
    FIELD_2=`echo "$line" | cut -f2`

Your version sets up two pipelines and creates four children (at least) for every line of input, whereas using read the way it was designed requires no external processes whatsoever.

If you are using cut because your lines are tab-separated and not just whitespace-separated, you can achieve the same effect with read by setting IFS locally:

while IFS=$'\t' read -r field1 field2 rest; do
  # Do something with field1 and field2

Even so, don't expect it to be fast. It will just be less agonizingly slow. You would be better off fixing your awk script so that it doesn't require multiple passes. (If you can do that with bash, it can be done with awk and probably with less code.)

Note: I set three variables rather than two, because read puts the rest of the line into the last variable. If there are only two fields, no harm is done; setting a variable to an empty string is something bash can do reasonably rapidly.

rici
  • 234,347
  • 28
  • 237
  • 341
1

As @codeforester points out, the original bash script spawns so many subprocesses.
Here's the modified version to minimize the overheads:

#!/bin/bash

while IFS=$'\t' read -r FIELD_1 FIELD_2 others; do

  if [[ "$MAIN_REF" == "$FIELD_1" ]]; then
    #echo "$line"
    if [[ "$FIELD_2" == "$REF_1" ]]; then
      let REF_1_COUNT++
    fi

    let LINE_COUNT++
      echo "$LINE_COUNT"

    if [[ "$LINE_COUNT" == "1000" ]]; then
      echo "$LINE_COUNT"
    fi
  fi
done < temp/refmatch

It runs more than 20 times faster than the original one but I'm afraid it may be the limitation of bash script.

tshiono
  • 21,248
  • 2
  • 14
  • 22