1

I have used scripts to monitor and extract data from log files for years and never questioned the basic toolset that most people take for granted. In particular grep and awk are used by almost everyone in the community.

I found the current grep bugs (some back dating a few years): http://savannah.gnu.org/bugs/?group=grep

And from the man pages for GNU grep 2.6.3:

Known Bugs

Large repetition counts in the {n,m} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.

Back-references are very slow, and may require exponential time.

And the man pages for GNU Awk 3.1.7:

BUGS

The -F option is not necessary given the command line variable assignment feature; it remains only for backwards compatibility.

Syntactically invalid single character programs tend to overflow the parse stack, generating a rather unhelpful message. Such programs are surprisingly difficult to diagnose in the completely general case, and the effort to do so really is not worth it.

I was interested in the limitations for example

  • when using complex regex,
  • extremely large files that are not rotated,
  • logs that are written to thousands of times per hundredths of a second

Would it just be a case of monitoring the memory usage of the script to make sure it is not using massive amounts of memory?

Is it good practice to implement a timeout feature for scripts that might possibly take a long time to execute?

Are there other good standards and structures that people also use when building solutions using these tools?

I found the extremely helpful answer for the equivalent findstr to give me a better understanding of scripting in a Windows environment: What are the undocumented features and limitations of the Windows FINDSTR command?

Community
  • 1
  • 1

1 Answers1

0

The awk/grep commands are both reading the log file in read-only mode, so there is no impact on the log file getting corrupted because of simultaneous access by both application (write mode) and awk/grep programs (read-only mode).

There is definitely a CPU, memory usage by the awk/grep programs which can impact the application writing to the log file. This impact is similar to any other process using the system resources. The grep/awk command are no exceptions. Depending on what the grep/awk scripts are doing, they can consume a lot of CPU/RAM. A badly written code in any language can cause problems. As suggest in the comments, it is good to constraint the monitoring processes. ulimit and cgroups are the option available for constraining the resources. Other good option is to use timeout which kill the script if it is taking more than expected time.

Jay Rajput
  • 1,813
  • 17
  • 23