1

I have the following string of greps

grep -E '[0-9]{3}\.[0-9]+ ms' file.log | grep -v "Cycle Based" | grep -Ev "[0-9]{14}\.[0-9]+ ms" > pruned.log

Which I need to run on a 10G log file. It's taking a bit longer than I am willing to wait so I am trying to use GNU parallel, but it's not clear to me how I can execute this chain of greps using parallel.

This is not a question of how to execute the fastest possible single grep, this is about how to execute a series of greps in parallel

Nick Chapman
  • 4,402
  • 1
  • 27
  • 41
  • Possible duplicate of [Fastest possible grep](https://stackoverflow.com/questions/9066609/fastest-possible-grep) – Michael Foukarakis Sep 26 '17 at 14:25
  • @MichaelFoukarakis, not a duplicate. I have already read that question and it is not what I am looking for. – Nick Chapman Sep 26 '17 at 14:31
  • 1
    You can put all of your `grep` cmds into a shell script and call it like `myBigGrep.sh file.log` and replace the filename at the front of the pipe with `${@}` .`parallel` would require multiple files to process. Are you willing to spend the time to `split` your big file into `file001.log, file002.log ...` ? It might pay off, but it will require time to test it. You might better spend your time to install log-rotate, so you have daily (hourly?) log files. Good luck. – shellter Sep 26 '17 at 14:42
  • Also, I think it would be possible to replace your pipeline of `grep`s with just one `awk` process. That might be faster, but I'd really be surprized if it was more that a 10% improvement in processing time. Good luck. – shellter Sep 26 '17 at 14:46
  • 1
    You don't have a chain of independent `grep`s; the output of one forms the input of the next. The pipeline already runs them in parallel as much as possible. – chepner Sep 26 '17 at 15:15

1 Answers1

2

Usually the limiting factor when grepping a file is the disk. If you have a single disk, then odds are that this will be limiting you.

However, if you have RAID10/50/60 or a distributed network filesystem, then parallelizing may speed up your processing:

doit() {
    grep -E '[0-9]{3}\.[0-9]+ ms' | grep -v "Cycle Based" | grep -Ev "[0-9]{14}\.[0-9]+ ms"
}
export -f doit
parallel --pipepart -a file.log --block -1 -k doit > pruned.log
Ole Tange
  • 31,768
  • 5
  • 86
  • 104