Grep large number of patterns from a huge log file

Question

I have a shell script which is invoked every hour via cron job and to search through the asterisk logs and provide me the unique ids for a call which ended with cause 31.

while read ref
do
cat sample.log | grep "$ref" | grep 'got hangup request, cause 31' | grep -o 'C-[0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z]' >> cause_temp.log
done < callref.log

The issue is that the while loop is too slow and for accuracy I have included 4 while loops like mentioned above to perform various checks.

callref.log file consists of call identifier values and every hour it will have about 50-90 thousand values and the script take about 45-50 minutes to complete the execution and email me the report.

It would be of great help if I would be able to cut down the execution time of the loops. Since the size of sample.log file is about 20 GB and for each loop the file is opened and search is performed, I figured that the while loop is the bottleneck here.

Have done the research and found some useful links like Link 1 Link 2

But the solutions suggested I cannot implement or do not know how to. Any suggestion would be helpful. Thanks

Since sample.log consists of sensitive information I would not be able to share any logs, but below are some sample logs which I got from the internet.

Dec 16 18:02:04 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"503"<sip:503@192.168.1.107>' failed for '192.168.1.137' - Wrong password
Dec 16 18:03:13 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"502"<sip:502@192.168.1.107>' failed for '192.168.1.137' - Wrong password
Dec 16 18:04:49 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"1737245082"<sip:1737245082@192.168.1.107>' failed for '192.168.1.137' - Username/auth name mismatch
Dec 16 18:04:49 asterisk1 asterisk[31774]: NOTICE[31787]: chan_sip.c:11242 in handle_request_register: Registration from '"100"<sip:100@192.168.1.107>' failed for '192.168.1.137' - Username/auth name mismatch
Jun 27 18:09:47 host asterisk[31774]: ERROR[27910]: chan_zap.c:10314 setup_zap: Unable to register channel '1-2'
Jun 27 18:09:47 host asterisk[31774]: WARNING[27910]: loader.c:414 __load_resource: chan_zap.so: load_module failed, returning -1
Jun 27 18:09:47 host asterisk[31774]: WARNING[27910]: loader.c:554 load_modules: Loading module chan_zap.so failed!

the file callref.log consists of a list of lines which looks like -

C-001ec22d
C-001ec23d
C-001ec24d
C-001ec31d
C-001ec80d

Also the desired output of the above while loop looks like C-001ec80d

Also my main concern is to make the while loop run faster. Like load all the values of callref.log in an array and search for all the values simultaneously in a single pass of sample.log if possible.

Might be worth looking into eg. the `-F` flag for grep, which may improve performance on the first two greps as you're using fixed strings (but don't use it for the last one). There's some good tips [here](https://stackoverflow.com/questions/13913014/grepping-a-huge-file-80gb-any-way-to-speed-it-up) which should help. — hnefatl, Aug 08 '17 at 11:37
How about you post some `sample.log` and `callref.log` and expected output and I believe we might be of help to you. — James Brown, Aug 08 '17 at 11:43
@hnefatl - its not the query taking time its the while loop that is time consuming — Nishant Singh, Aug 08 '17 at 12:01
@Maurice Perry - I can use awk, but I am not familiar with it, of you can suggest any ideas via an example — Nishant Singh, Aug 08 '17 at 12:47
If you can't post real data, we don't mind. Just post some mock-up data that works with that callref.log. We don't like to invent test data ourselves since your next question is how to convert it to your data. — James Brown, Aug 08 '17 at 12:56

Calvin Taylor · Answer 1 · 2017-08-22T07:30:00.443

I spent a day building a test framework and testing variations of different commands and I think you already have the fastest one.

Which leads me to think that if you are to get better performance you should look into a log digesting framework, like ossec (where your log samples came from) perhaps splunk. Those may be too clumsy for your wishes. Alternatively you should consider designing and building something in java/C/perl/awk better suited to parsing.

Running your existing script more frequently will also help.

Good luck! If you like I can box up the work I did and post it here, but I think its overkill.

as requested; CalFuncs.sh: a library I source in most of my scripts

#!/bin/bash

LOGDIR="/tmp"
LOG=$LOGDIR/CalFunc.log
[ ! -d "$LOGDIR" ] && mkdir -p $(dirname $LOG)

SSH_OPTIONS="-o StrictHostKeyChecking=no -q -o ConnectTimeout=15"
SSH="ssh $SSH_OPTIONS -T"
SCP="scp $SSH_OPTIONS"
SI=$(basename $0)

Log() {
    echo "`date` [$SI] $@" >> $LOG
}

Run() {
    Log "Running '$@' in '`pwd`'"
  $@ 2>&1 | tee -a $LOG
}

RunHide() {
    Log "Running '$@' in '`pwd`'"
    $@ >> $LOG 2>&1
}

PrintAndLog() {
    Log "$@"
    echo "$@"
}

ErrorAndLog() {
    Log "[ERROR] $@ "
    echo "$@" >&2
}

showMilliseconds(){
  date +%s
}

runMethodForDuration(){
  local startT=$(showMilliseconds)
  $1
  local endT=$(showMilliseconds)
  local totalT=$((endT-startT))
  PrintAndLog "that took $totalT seconds to run $1"
  echo $totalT
}

genCallRefLog.sh - generates fictitious callref.log size depending on argument

#!/bin/bash
#Script to make 80000 sequential lines of callref.log this should suffice for a POC
if [ -z "$1" ] ; then
  echo "genCallRefLog.sh requires an integer of the number of lines to pump out of callref.log"
  exit 1
fi
file="callref.log"
[ -f "$file" ] && rm -f "$file"  # del file if exists
i=0 #put start num in here
j="$1" #put end num in here
echo "building $j lines of callref.log"
for ((  a=i ;  a < j;  a++  ))
do
  printf 'C-%08x\n' "$a" >> $file
done

genSampleLog.sh generates fictitious sample.log size depending on argument

#!/bin/bash
#Script to make 80000 sequential lines of callref.log this should suffice for a POC
if [ -z "$1" ] ; then
  echo "genSampleLog.sh requires an integer of the number of lines to pump out of sample.log"
  exit 1
fi
file="sample.log"
[ -f "$file" ] && rm -f "$file"  # del file if exists
i=0 #put start num in here
j="$1" #put end num in here
echo "building $j lines of sample.log"
for ((  a=i ;  a < j;  a++  ))
do
  printf 'Dec 16 18:02:04 asterisk1 asterisk[31774]: NOTICE[31787]: C-%08x got hangup request, cause 31\n' "$a" >> $file
done

and finally the actual test script I used. Often I would comment out the building scripts as they only need to run when changing the log size. I also typically would only run one testing function at a time and record the results.

test.sh

#!/bin/bash
source "./CalFuncs.sh"

targetLogFile="cause_temp.log"
Log "Starting"

checkTargetFileSize(){
  expectedS="$1"
  hasS=$(cat $targetLogFile | wc -l)
  if [ "$expectedS" != "$hasS" ] ; then
    ErrorAndLog "Got $hasS but expected $expectedS, when inspecting $targetLogFile"
    exit 244
  fi
}

standard(){
  iter=0
  while read ref
  do
    cat sample.log | grep "$ref" | grep 'got hangup request, cause 31' | grep -o 'C-[0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z][0-9a-z]' >> $targetLogFile
  done < callref.log
}

subStandardVarient(){
  iter=0
  while read ref
  do
    cat sample.log | grep 'got hangup request, cause 31' | grep -o "$ref"  >> $targetLogFile
  done < callref.log
}

newFunction(){
  grep -f callref.log sample.log | grep 'got hangup request, cause 31'  >> $targetLogFile
}

newFunction4(){
  grep 'got hangup request, cause 31' sample.log | grep -of 'callref.log'>> $targetLogFile
}

newFunction5(){
  #splitting grep
  grep 'got hangup request, cause 31' sample.log > /tmp/somefile
  grep -of 'callref.log' /tmp/somefile >> $targetLogFile
}

newFunction2(){
  iter=0

  while read ref
  do
    ((iter++))
    echo "$ref" | grep 'got hangup request, cause 31' | grep -of 'callref.log' >> $targetLogFile
  done < sample.log
}

newFunction3(){
  iter=0
  pat=""
  while read ref
  do
    if [[ "$pat." != "." ]] ; then
      pat="$pat|"
    fi
    pat="$pat$ref"
  done < callref.log
  # Log "Have pattern $pat"
  while read ref
  do
    ((iter++))
    echo "$ref" | grep 'got hangup request, cause 31' | grep -oP "$pat" >> $targetLogFile
  done < sample.log
  #grep: regular expression is too large
}

[ -f "$targetLogFile" ] && rm -f "$targetLogFile"

numLines="100000"
Log "testing algorithms with $numLines in each log file."

setupCallRef(){
  ./genCallRefLog.sh $numLines
}

setupSampleLog(){
  ./genSampleLog.sh $numLines
}

setupCallRef
setupSampleLog

runMethodForDuration standard > /dev/null
checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration subStandardVarient > /dev/null
checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration newFunction > /dev/null
checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction2 > /dev/null
# checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction3 > /dev/null
# checkTargetFileSize "$numLines"
# [ -f "$targetLogFile" ] && rm -f "$targetLogFile"
# runMethodForDuration newFunction4 > /dev/null
# checkTargetFileSize "$numLines"
[ -f "$targetLogFile" ] && rm -f "$targetLogFile"
runMethodForDuration newFunction5 > /dev/null
checkTargetFileSize "$numLines"

The above shows that the existing method was always faster than anything I came up with. I think someone took care to optimize it.

thanks for the suggestion, but I am already taking past 1 hour logs from the main log file before processing it using sed by matching the time stamps from the main log file. callref.log file consists of call identifier values and every hour it will have about 50-90 thousand values. Also my main concern is to make the while loop run faster. Like load all the values of callref.log in an array and search for all the values simultaneously in a single pass of sample.log — Nishant Singh, Aug 08 '17 at 13:24
So your search is dynamic? eg your callref.log has changing values in it? — Calvin Taylor, Aug 09 '17 at 15:47
I thank you for your effort, mean a lot to me. Please do share your findings, might be something valuable to me in those. Also I did try and reduce the script from hourly cronjob to 30 minutes, but its not a permanent solution, I am now trying putting all the logs in elastic search and query it from there which will eliminate the hourly cronjobs and using the kibana dashboard hoping to see the action in realtime. — Nishant Singh, Aug 22 '17 at 07:11
I didn't rerun anything but updated with the scripts I used. I'm sending elastic search a resume. =) thanks — Calvin Taylor, Aug 22 '17 at 07:41

James Brown · Answer 2 · 2017-08-08T16:50:41.823

Since you could not produce adequate sample logs to test against even when requested, I whipped up some test material myself:

$ cat callref.log
a
b
$ cat sample.log
a 1
b 2
c 1

Using awk:

$ awk 'NR==FNR {             # hash callrefs
    a[$1]
    next
}
{                            # check callrefs from sample records and output when match
    for(l in a)
        if($0 ~ l && $0 ~ 1) # 1 is the static string you look for along a callref
            print l
}' callref.log sample.log
a 1

HTH

Grep large number of patterns from a huge log file

2 Answers2