Do I need stay away from bash scripts for big files?

Question

I have big log files(1-2 gb and more). I'm new on programming and bash so useful and easy for me. When I need something, I can do (someone help me on here). Simple scripts works fine, but when I need complex operations, maybe bash so slow maybe my programming skill so bad, it's so slow working.

So do I need C for complex programming on my server log files or do I need just optimization my scripts?

If I need just optimization, how can I check where is bad or where is good on my codes?

For example I have while-do loop:

  while read -r date month size;
  do  
  ...
  ...
  done < file.tmp

How can I use awk for faster run?

see similar question on bash script profiling http://stackoverflow.com/questions/5014823/how-to-profile-a-bash-shell-script — SMA, Nov 09 '14 at 11:02
Of course, you're making sure that your Bash style is already optimal: avoiding unnecessary subshells, reading files unnecessarily too many times, adopting the good algorithm, using builtins instead of external commands as much as possible, etc. — gniourf_gniourf, Nov 09 '14 at 11:15
How can I check this if I don't know where is bad or good? Any tool or something? — onur, Nov 09 '14 at 11:57
What are the commands you are using , So that we can determine which command is slower. — Sriharsha Kalluru, Nov 09 '14 at 14:55
Ya, like Sriharsha Kalluru said, you need to tell us what those `...` are, for us to decide which way is better. (And how to convert it to awk..) — Robin Hsu, Nov 10 '14 at 09:24

score 2 · Accepted Answer · answered Nov 09 '14 at 18:10

That depends on how you use bash. To illustrate, consider how you'd sum a possibly large number of integers.

This function does what Bash was meant for: being control logic for calling other utilities.

sumlines_fast() {
   awk '{n += $1} END {print n}'
}

It runs in 0.5 seconds on a million line file. That's the kind of bash code you can very effectively use for larger files.

Meanwhile, this function does what Bash is not intended for: being a general purpose programming language:

sumlines_slow() {
   local i=0
   while IFS= read -r line
   do
     (( i += $line ))
   done
   echo "$i"
}

This function is slow, and takes 30 seconds to sum the same million line file. You should not be doing this for larger files.

Finally, here's a function that could have been written by someone who has no understanding of bash at all:

sumlines_garbage() {
   i=0
   for f in `cat`
   do
     i=`echo $f + $i | bc`
   done
   echo $i 
}

It treats forks as being free and therefore runs ridiculously slowly. It would take something like five hours to sum the file. You should not be using this at all.

Thank you. Edited my question, how can I use awk for my loop? — onur, Nov 10 '14 at 08:23

Do I need stay away from bash scripts for big files?

1 Answers1