0

I have a while loop that that reads in a mail log file and puts it into an array so I'll be able to search through the array and match up/search for a flow. Unfortunately, the while loop is taking a long time to get through the file, it is a very large file but there must be another faster way of doing this.

cat /home/maillog |grep "Nov 13" |grep "from=<xxxx@xxxx.com>" |awk '{print $6}' > /home/output_1 

while read line; do awk -v line="$line" '$6 ~ line { print $0 }' /home/maillog >> /home/output_2 ; done < /home/output_1

Any ideas? Thank's in advance.

kvantour
  • 25,269
  • 4
  • 47
  • 72
  • 2
    I think you can make all that bash only with awk. awk read each line of the file and let you filter, save variables, etc. I recently used awk in a file with more than 20k of rows and it went super fast. That was my first awk script, and I found this web very useful https://www.tutorialspoint.com/awk/ – Daniel Rodríguez Nov 19 '18 at 08:58
  • 2
    You're starting a new instance of `awk` for each line, that must be slow. – choroba Nov 19 '18 at 09:00
  • Do you really need `/home/output_1` file ? avoiding the disk usage between the two calls can help to improve performance. – Ôrel Nov 19 '18 at 09:19
  • Yes, I do. I need the file "/home/output_1" because it gives an ID of every sent mail. – Julián Díaz Nov 19 '18 at 09:25
  • 1
    You read `/home/maillog` too many time, one per line into `output_1` that is why is so slow. Rework it to read it only once or twice – Ôrel Nov 19 '18 at 09:36
  • 1
    can you share an example of input and output ? – Ôrel Nov 19 '18 at 09:43
  • But I need to read line by line, I don't know how to read one time and process every line as a variable – Julián Díaz Nov 19 '18 at 09:44
  • every mail sent has a unique ID (A1F43200021D):: Nov 13 05:19:01 smtp postfix/local[26754]: A1F43200021D: to=, relay=local, delay=0.06, delays=0.02/0/0/0.03, dsn=5.2.2, status=bounced (cannot update mailbox /var/mail/root for user root. error writing message: File too large) I need extract that ID (A1F43200021D:) to a single file. after that i Need collect all lines with that ID – Julián Díaz Nov 19 '18 at 09:54
  • Why are you using a while-loop in the first place? You can use `grep -o ` in order just to see a part of a line. I just tried your example, and `grep -o "ID ([0-9A-Z]*)"` seems to be working fine. – Dominique Nov 19 '18 at 10:34

1 Answers1

4

Let us analyse your script and try to explain why it is slow.

Let's first start with a micro-optimization of your first line. It's not going to speed up things, but this is merely educational.

cat /home/maillog |grep "Nov 13" |grep "from=<xxxx@xxxx.com>" |awk '{print $6}' > /home/output_1 

In this line you make 4 calls to different binaries which in the end can be done by a single one. For readability, you could keep this line. However, here are two main points:

  1. Useless use of cat. The program cat is mainly used to concattenate files. If you just add a single file, then it is basically overkilling. Especially if you want to pass it to grep.

    cat file | grep ... => grep ... file
    
  2. multiple greps in combination with awk ... can be written as a single awk

    awk '/Nov 13/ && /from=<xxxx@xxxx.com>/ {print $6}'
    

So the entire line can be written as:

awk '/Nov 13/ && /from=<xxxx@xxxx.com>/ {print $6}' /home/maillog > /home/output_1

The second part is where things get slow:

while read line; do 
   awk -v line="$line" '$6 ~ line { print $0 }' /home/maillog >> /home/output_2 ;
done < /home/output_1

Why is this slow? Per line you read form /home/output_1, you load the program awk into memory, you open the file /home/maillog, process every line of it and close the file /home/maillog. At the same time, per line you process, you open /home/output_2 every time, put the file pointer to the end of the file, write to the file and close the file again.

The whole program can actually be done with a single awk:

awk '(NR==FNR) && /Nov 13/ && /from=<xxxx@xxxx.com>/ {a[$6];next}($6 in a)' /home/maillog /home/maillog > /home/output2
kvantour
  • 25,269
  • 4
  • 47
  • 72