while loop extremely slow read file

Question

I have a while loop that that reads in a ftp log file and puts it into an array so I'll be able to search through the array and match up/search for a flow. Unfortunately the while loop is taking forever to get through the file, it is a very large file but there must be another faster way of doing this.

# read file into array for original search results
while read FTP_SEARCH
do
ogl_date[count]=`echo $FTP_SEARCH | awk '{print $1, $2}'`
ogl_time[count]=`echo $FTP_SEARCH | awk '{print $3}'`
ogl_server[count]=`echo $FTP_SEARCH | awk '{print $4}'`
ogl_id[count]=`echo $FTP_SEARCH | awk '{print $5}'`
ogl_type[count]=`echo $FTP_SEARCH | awk -F '[' '{print $1}' | awk '{print $5}'`
ogl_pid[count]=`echo $FTP_SEARCH | awk -F'[' '{print $2}' | awk -F']' '{print $1}'`
ogl_commands[count]=`echo $FTP_SEARCH | awk '{
    for(i = 6; i <= NF; i++) 
        print $i;
    }'`

let "count += 1"

done < /tmp/ftp_search.14-12-02


Dec  1 23:59:03 sslmftp1 ftpd[4152]: USER xxxxxx  
Dec  1 23:59:03 sslmftp1 ftpd[4152]: PASS password  
Dec  1 23:59:03 sslmftp1 ftpd[4152]: FTP LOGIN FROM 172.19.x.xx [172.19.x.xx], xxxxxx  
Dec  1 23:59:03 sslmftp1 ftpd[4152]: PWD  
Dec  1 23:59:03 sslmftp1 ftpd[4152]: CWD /test/data/872507/  
Dec  1 23:59:03 sslmftp1 ftpd[4152]: TYPE Image`
Dec  1 23:59:03 sslmftp1 ftpd[4152]: PASV
Dec  1 23:59:04 sslmftp1 ftpd[4152]: NLST
Dec  1 23:59:04 sslmftp1 ftpd[4152]: FTP session closed
Dec  1 23:59:05 sslmftp1 ftpd[4683]: USER xxxxxx 
Dec  1 23:59:05 sslmftp1 ftpd[4683]: PASS password
Dec  1 23:59:05 sslmftp1 ftpd[4683]: FTP LOGIN FROM 172.19.1.24 [172.19.x.xx], xxxxxx 
Dec  1 23:59:05 sslmftp1 ftpd[4683]: PWD
Dec  1 23:59:05 sslmftp1 ftpd[4683]: CWD /test/data/944837/
Dec  1 23:59:05 sslmftp1 ftpd[4683]: TYPE Image

Please post an example line from ftp_search.14-12-02. The multiple calls to `awk` to parse each line is what is slowing you down. There are much better ways to parse in `bash`, but I'll need to see what a line looks like to suggest the best way. — chepner, Mar 13 '14 at 16:02
Or, as there aren't any calls to other external programs, it could be done all in 1 awk program. Sample data is required. Good luck. — shellter, Mar 13 '14 at 16:06
`Dec 1 23:59:03 sslmftp1 ftpd[4152]: USER xxxxxx Dec 1 23:59:03 sslmftp1 ftpd[4152]: PASS password Dec 1 23:59:03 sslmftp1 ftpd[4152]: FTP LOGIN FROM 172.19.x.xx [172.19.x.xx], xxxxxx Dec 1 23:59:03 sslmftp1 ftpd[4152]: PWD Dec 1 23:59:03 sslmftp1 ftpd[4152]: CWD /test/data/872507/ Dec 1 23:59:03 sslmftp1 ftpd[4152]: TYPE Image` — cycloxr, Mar 13 '14 at 16:18
sorry, line breaks not working in my comment, but it's a new line at the start of Dec — cycloxr, Mar 13 '14 at 16:24
You can add the output to the question itself, not in a comment. — chepner, Mar 13 '14 at 16:25
Prefer [tag:perl] for this kind of jobs! It's the basic behaviour of *Practical Extraction and Research Language*! — F. Hauri - Give Up GitHub, Mar 13 '14 at 16:43
From the man page: "Perl officially stands for Practical Extraction and Report Language, except when it doesn't." — chepner, Mar 13 '14 at 16:52
@chepner Of course, today Perl is used in many different ways! But is initial goals was exactly this kind of jobs. — F. Hauri - Give Up GitHub, Mar 13 '14 at 17:04
For a more general question which is also a common FAQ, see https://stackoverflow.com/questions/13762625/bash-while-read-loop-extremely-slow-compared-to-cat-why — tripleee, Sep 08 '21 at 19:04

score 6 · Accepted Answer · answered Mar 13 '14 at 16:16

6

You don't need to keep an iterator to add to arrays. You can simply do array+=(item) (not array+=item).
Getting the columns in the input is as simple as using read with multiple target variables. As a bonus, the last variable gets the Nth word and all subsequent words. See help [r]ead.

This saves a ton of forks, but I haven't tested how fast it is.

ogl_date=()
[...]
ogl_commands=()

while read -r date1 date2 time server id type pid commands
do
    ogl_date+=("$date1 $date2")
    [...]
    ogl_commands+=("$commands")
done < /tmp/ftp_search.14-12-02

answered Mar 13 '14 at 16:16

l0b0

55,365
30
138
223

that worked great but it's still quite slow, took several minutes to go through everything, any other ideas I could try. I greatly appreciate your help! – cycloxr Mar 13 '14 at 16:35
What are you trying to do? Is there no way you can exclude most of the file before detailed processing? – l0b0 Mar 13 '14 at 16:40
@user2208986 did you try this as is, or did you add in all your custom logic before benchmarking it? The major performance killer in shell scripts is forking, which happens when using `\`..\``, `$(..)`, calling external commands and piping. This example has none of those, but yours had a lot. – that other guy Mar 13 '14 at 16:43
Yes, it worked as is. Unsure how I would make it smaller, already did a grep to minimize it. Basically, I have a ftplpg file with above data, and I want to show the entire flow by searching username or IP. So I figured I'd read data into array, search for criteria, and then match that process id with others so I would get the entire flow. – cycloxr Mar 13 '14 at 16:49
1

`bash` isn't designed for this type of data processing. Using a general purpose programming language to do this would probably be much faster. – chepner Mar 13 '14 at 16:53
+1 @chepner. Python, Ruby or even Perl are much better at general purpose file munging. Two of them even have sane syntax ;) – l0b0 Mar 13 '14 at 16:55
So if I were to do this in perl any ideas on how I would attack it? – cycloxr Mar 13 '14 at 18:28

while loop extremely slow read file

1 Answers1

Linked