I have used scripts to monitor and extract data from log files for years and never questioned the basic toolset that most people take for granted. In particular grep and awk are used by almost everyone in the community.
I found the current grep bugs (some back dating a few years): http://savannah.gnu.org/bugs/?group=grep
And from the man pages for GNU grep 2.6.3:
Known Bugs
Large repetition counts in the {n,m} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.
Back-references are very slow, and may require exponential time.
And the man pages for GNU Awk 3.1.7:
BUGS
The -F option is not necessary given the command line variable assignment feature; it remains only for backwards compatibility.
Syntactically invalid single character programs tend to overflow the parse stack, generating a rather unhelpful message. Such programs are surprisingly difficult to diagnose in the completely general case, and the effort to do so really is not worth it.
I was interested in the limitations for example
- when using complex regex,
- extremely large files that are not rotated,
- logs that are written to thousands of times per hundredths of a second
Would it just be a case of monitoring the memory usage of the script to make sure it is not using massive amounts of memory?
Is it good practice to implement a timeout feature for scripts that might possibly take a long time to execute?
Are there other good standards and structures that people also use when building solutions using these tools?
I found the extremely helpful answer for the equivalent findstr to give me a better understanding of scripting in a Windows environment: What are the undocumented features and limitations of the Windows FINDSTR command?