Pattern match loop is slow, ways to speed it up?

Question

Ok, so I wrote a script in bash to show the entire flow of a ftp connection by searching username or IP address. I had it read the data into an array, search for criteria, and then match that process id with others so I would get the entire flow.

The performance however was extremely slow and from suggestion of others on the experts exchange community I decided to give it a try in perl. I am attempting to learn as much as I can but still have a long way to go. I'm attempting to search for criteria, take the process id of that line, and then read all the lines into an array that matches that process id so I'm basically getting the entire flow of the ftp connection.

I'm assuming I would read each line in from the file, do a pattern match on it and if it matches to the IP address that I'm searching for I would then copy that line to an array. I'm then thinking that after I read those lines into the array I'll go back and grab the process id from each of those lines, do another search on the file and put all the lines matching the process id into a new array, and then print the array out.

I have the following code which is being used to match up lines of a file based on whether or not it matches a pattern from an array.

The array @pids has the following for data, but has several hundreds more than this:

If the line I'm reading in has this number in it then I push it to a new array. Once it gets to the end of the file it goes back to the beginning of the file and works on matching the next element of the array @pids. I then print the new array out to a file.

Unfortunately, The loop is taking forever, is there any way I can speed this up? I'm assuming because I'm keep going through the file over and over again, making things a bit repetitive but not sure how else I should do it.

seek INPUT, 0, 0;

my @flow;
my $count = 0;
my $pid_count = 0;

foreach my $mPID(@pids){
    while(my $line = <INPUT>){
        if ($line =~ /$mPID/){
            push @flow, $line;
        }
    }
    push @flow, "###############\n";
    seek INPUT, 0, 0;
}

open (OUTPUT, '>'.$output) or die "Couldn't read $output.\n";
print OUTPUT @flow;
close (OUTPUT);

Here's an example of the data coming from :

Dec  1 23:59:03 ftp1 ftpd[4152]: PASV
Dec  1 23:59:04 ftp1 ftpd[4152]: NLST
Dec  1 23:59:04 ftp1 ftpd[4152]: FTP session closed
Dec  1 23:59:05 ftp1 ftpd[4682]: USER test1
Dec  1 23:59:05 ftp1 ftpd[4682]: PASS password
Dec  1 23:59:08 ftp1 ftpd[4682]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prd

Example of data I'm getting all the pids matching the IP from:

Dec  1 23:59:08 ftp1 ftpd[4682]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prd
Dec  1 23:59:10 ftp1 ftpd[4690]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prod1
Dec  1 23:59:10 ftp1 ftpd[4692]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prod
Dec  1 23:59:11 ftp1 ftpd[4693]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], test1
Dec  1 23:59:14 ftp1 ftpd[4696]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], test1
Dec  1 23:59:40 ftp1 ftpd[5320]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], test1
Dec  1 23:59:47 ftp1 ftpd[5325]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prd
Dec  1 23:59:48 ftp1 ftpd[5328]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prod1
Dec  1 23:59:49 ftp1 ftpd[5329]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prod
Dec  1 23:59:49 ftp1 ftpd[5330]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], test1
Dec  2 00:00:09 ftp1 ftpd[9876]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], test1
Dec  2 00:00:25 ftp1 ftpd[12830]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], test1
Dec  2 00:00:25 ftp1 ftpd[12832]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prd
Dec  2 00:00:27 ftp1 ftpd[12850]: FTP LOGIN FROM 192.168.0.2 [192.168.0.2], prod1

Thanks!

Why not loop through all of the patterns for every line? Write now you read each line of the file `scalar @pids` times. IO is typically the bottleneck for these kinds of problems, if you can minimize it you should. — Hunter McMillen, Apr 03 '14 at 16:23
I'm looking to have it sorted which is why I have the @pid loop on the outside. — cycloxr, Apr 03 '14 at 16:28
This looks like an [XY problem](http://www.perlmonks.org/?node_id=542341). You've described *how* you're trying to solve your problem, but not *what* you're trying to solve. I suspect there is a better way to tackle this, but without more details, it's hard to say. Please tell us what exactly you're trying to achieve and we may be able to recommend some better approaches. — ThisSuitIsBlackNot, Apr 03 '14 at 16:28
@user2208986 If you use a hashref whose keys are pids and value is an arrayref of lines like mob shows below, you can still maintain order. — Hunter McMillen, Apr 03 '14 at 16:30
(You asked a previous question, since deleted, describing the actual problem you were trying to solve; based on your description of the problem in that question, there was no need for an array or for multiple passes through the file, which is why I made my last comment) — ThisSuitIsBlackNot, Apr 03 '14 at 16:31
@ThisSuitIsBlackNot I have added the problem I'm trying to solve from that deleted post you saw and commented on. I'm simply learning the best I can. Everything I've has been very helpful thus far so thank you. — cycloxr, Apr 03 '14 at 16:51
Based on your edit and my memory of your deleted question: If you're only trying to trace a single session, there's no need to store the PIDs for every session in your entire log file. Nor do you need to compare every line in the file with every PID. That is very resource intensive and completely unnecessary. Am I correct that you're only trying to trace a single session? — ThisSuitIsBlackNot, Apr 03 '14 at 16:59
I'm trying to trace all sessions for a given IP which could have several sessions — cycloxr, Apr 03 '14 at 17:00
In that case, you still don't need to store all PIDs for all sessions in the log file; you only need the PIDs for the sessions corresponding to that IP. You can do this in a single pass through the file. Can you add the lines from the logfile that include the IP address into your question? — ThisSuitIsBlackNot, Apr 03 '14 at 17:05
@ThisSuitIsBlackNot I have added the lines from the logfile that include the IP address in question. I had grep'd these lines out of the original file that includes the entire flow to just get the PIDs. But even with this IP grep'd out, I still have 7860 lines that match that IP address. — cycloxr, Apr 03 '14 at 17:12

Schwern · Accepted Answer · 2014-04-03T17:32:30.080

Anytime you have a loop within a loop that's likely a performance problem. Let's say you have 1000 pids and 1 million lines of log files. Looping over every line in the file for every pid is 1000 * 1 million which is 1 BILLION DOLLARS!!! Err... iterations.

Right now you're checking each line if it has each PID in it. If you were doing this by hand, you wouldn't do that. You'd scan the line for things that look like PIDs and see if they're on your list. PIDs are easy to identify, they're integers, so let's do that. We can start simple, just match for numbers.

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

# Some test PIDs
my @pids = (
    12,
    1123,
    1234
);

# Put the PIDs into a hash.  Each line which matches will be stored.
my %pids = map { $_ => [] } @pids;

# Loop through the lines
while(my $line = <DATA>) {
    # Look for a PID
    if(my($pid) = $line =~ m{ \[ \s*(\d+) \]: }x) {
        # Push it into the appropriate PID slot if it's on our list
        push @{$pids{$pid}}, $line if $pids{$pid};
    }
}

# Output the PIDs which have matching lines
for my $pid (keys %pids) {
    my $lines = $pids{$pid};
    next if !@$lines;

    print "PID: $pid\n";
    print @$lines;
    print "##################\n";

}


# Some test lines
__DATA__
Dec  1 23:59:03 ftp1 ftpd[  12]: PASV
Dec  1 23:59:04 ftp1 ftpd[1123]: NLST
Dec  1 23:59:04 ftp1 ftpd[3114]: FTP session closed
Dec  1 23:59:05 ftp1 ftpd[9999]: USER test1
Dec  1 23:59:05 ftp1 ftpd[ 123]: PASS password

Now you only have to loop through the file once. Since the PID list is small (max PID is typically in the tens of thousands but even a million isn't that big) storing each line which matches is unlikely to take up a lot of memory, so it's ok to store all the matching lines. If output order doesn't matter, you can avoid storing the lines and print them as they match like grep.

while(my $line = <DATA>) {
    # Look for a PID
    if(my($pid) = $line =~ m{ \[ \s*(\d+) \]: }x) {
        print $line if $pids{$pid};
    }
}

A note about the PID matching. In your original example you're just looking to see if the pid is anywhere in the line, $line =~ /$mPID/. This is a problem. PID 123 will match ftpd[1234]. PID 59 will match on 23:59:04.

By looking for entire numbers and then seeing if they're in the list, that saves us from the former. ftpd[1234] won't match PID 123. But it doesn't save us from accidentally matching the date or other numbers in the comments. Based on the sample lines you gave, I used the more restrictive $line =~ m{ \[ \s*(\d+) \]: }x to look for PIDs in the right spot.

You'll have to look at the data to determine if you can get away with this. If not, you can at least match all the numbers in the line with my @pids = $line =~ m{ (\d+) }gx ).

How would I go about modifying this so I only get the lines that are related to the specific IP address. I was thinking matching it by time frame would be the way to go but am having difficulty filtering them out. — cycloxr, Apr 08 '14 at 14:57

Borodin · Answer 2 · 2014-04-03T16:59:50.637

You should build a regular expression out of your array, like this

my $pids = join '|', @pids;
$pids = qr/$pids/;

then you only have to make a single comparison for each line of the input file.

open my $out_fh, '>', $output or die qq{Couldn't open "$output" for writing: $!\n};

while (my $line = <$in_fh>) {
  print $out_fh, $line if $line =~ $pids;
}

close $out_fh;

Note also that you should use lexical filehandles with meaningful names, and the three-parameter form of open.

If you require your output sorted in order of the PID values then there is a little more work to do, but it is quite possible.

Update

If you require your output separated into groups for each PID then you will have to store your output in a hash before printing it, like this

my $pids = join '|', @pids;
$pids = qr/($pids)/;

my %output;

while (my $line = <$in_fh>) {
  push @{ $output{$1} }, $line if $line =~ $pids;
}

open my $out_fh, '>', $output or die qq{Couldn't open "$output" for writing: $!\n};

for my $pid (@pids) {
  next unless my $lines = $output{$pid};
  print $out_fh $_ for @$lines;
  print $out_fh "###############\n";
}

close $out_fh;

Note that neither of these solutions have been tested beyind compilation, as it is a significant amount of work to create a set of test data.

Update 2

This program uses the new data from your updated question.

use strict;
use warnings;

my $outfile = 'result.txt';

my @pids = qw/ 4682 4690 4692 4693 4696 5320 /;
my $pids = join '|', @pids;
$pids = qr/\b($pids)\b/;

open my $in_fh, 'logfile.txt' or die $!;

my %output;
while (my $line = <$in_fh>) {
  push @{ $output{$1} }, $line if $line =~ $pids;
}


open my $out_fh, '>', $outfile or die qq{Couldn't open "$outfile" for writing: $!\n};

for my $pid (@pids) {
  next unless my $lines = $output{$pid};
  print $out_fh $_ for @$lines;
  print $out_fh "###############\n";
}

close $out_fh;

output

Dec  1 23:59:05 ftp1 ftpd[4682]: USER test1
Dec  1 23:59:05 ftp1 ftpd[4682]: PASS password
###############

The pids regex should have parens and be anchored, as in `qr/\b($pids)\b/` so that '123' doesn't match a pid of '4123'. — Andy Lester, Apr 03 '14 at 16:37
@AndyLester: Yes, probably. I'm recreating the behaviour of the OP's code. Not sure whether that's a good idea. — Borodin, Apr 03 '14 at 16:41
@AndyLester: Having seen the OP's input file I'm agreeing with you. Fixed. — Borodin, Apr 03 '14 at 17:01

score 2 · Answer 3 · edited May 23 '17 at 11:49

My understanding of your problem is that you have:

An FTP log
An IP address

and you want to trace the session or sessions initiated from that IP. You can do this in a single pass through your log file like this (I constructed some sample data based on another question you asked previously):

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;
use Regexp::Common qw(net);

my $ip = '192.0.2.0';

my (%pid, %session);
while (<DATA>) {
    chomp;

    if (/ftpd\[(\d+)\]:\s+(?:USER|PASS)/) {
        push @{ $session{$1} }, $_;
    }
    elsif (/ftpd\[(\d+)\]:\s+FTP LOGIN FROM ($RE{net}{IPv4})/) {
        if ($2 eq $ip) {
            $pid{$1} = 1;
            push @{ $session{$1} }, $_;
        }
        else {
            delete $session{$1};
        }
    }
    elsif (/ftpd\[(\d+)\]:/) {
        push @{ $session{$1} }, $_ if exists $pid{$1};
    }
}

print Dumper \%session;

__DATA__
Dec  1 23:59:03 sslmftp1 ftpd[4152]: USER xxxxxx
Dec  1 23:59:03 sslmftp1 ftpd[4152]: PASS password
Dec  1 23:59:03 sslmftp1 ftpd[4152]: FTP LOGIN FROM 192.0.2.0 [192.0.2.0], xxxxxx
Dec  1 23:59:03 sslmftp1 ftpd[4152]: PWD
Dec  1 23:59:03 sslmftp1 ftpd[4152]: CWD /test/data/872507/
Dec  1 23:59:03 sslmftp1 ftpd[4152]: TYPE Image`
Dec  1 23:59:03 sslmftp1 ftpd[4152]: PASV
Dec  1 23:59:04 sslmftp1 ftpd[4152]: NLST
Dec  1 23:59:04 sslmftp1 ftpd[4152]: FTP session closed
Dec  1 23:59:05 sslmftp1 ftpd[4683]: USER xxxxxx
Dec  1 23:59:05 sslmftp1 ftpd[4683]: PASS password
Dec  1 23:59:05 sslmftp1 ftpd[4683]: FTP LOGIN FROM 192.0.2.1 [192.0.2.1], xxxxxx
Dec  1 23:59:05 sslmftp1 ftpd[4683]: PWD
Dec  1 23:59:05 sslmftp1 ftpd[4683]: CWD /test/data/944837/
Dec  1 23:59:05 sslmftp1 ftpd[4683]: TYPE Image
Dec  1 23:59:06 sslmftp1 ftpd[4925]: USER xxxxxx
Dec  1 23:59:06 sslmftp1 ftpd[4925]: PASS password
Dec  1 23:59:06 sslmftp1 ftpd[4925]: FTP LOGIN FROM 192.0.2.0 [192.0.2.0], xxxxxx
Dec  1 23:59:07 sslmftp1 ftpd[4925]: PWD
Dec  1 23:59:08 sslmftp1 ftpd[4925]: CWD /test/data/944837/
Dec  1 23:59:09 sslmftp1 ftpd[4925]: TYPE Image

Output:

$VAR1 = {
          '4152' => [
                      'Dec  1 23:59:03 sslmftp1 ftpd[4152]: USER xxxxxx  ',
                      'Dec  1 23:59:03 sslmftp1 ftpd[4152]: PASS password  ',
                      'Dec  1 23:59:03 sslmftp1 ftpd[4152]: FTP LOGIN FROM 192.0.2.0 [192.0.2.0], xxxxxx  ',
                      'Dec  1 23:59:03 sslmftp1 ftpd[4152]: PWD  ',
                      'Dec  1 23:59:03 sslmftp1 ftpd[4152]: CWD /test/data/872507/  ',
                      'Dec  1 23:59:03 sslmftp1 ftpd[4152]: TYPE Image`',
                      'Dec  1 23:59:03 sslmftp1 ftpd[4152]: PASV',
                      'Dec  1 23:59:04 sslmftp1 ftpd[4152]: NLST',
                      'Dec  1 23:59:04 sslmftp1 ftpd[4152]: FTP session closed'
                    ],
          '4925' => [
                      'Dec  1 23:59:06 sslmftp1 ftpd[4925]: USER xxxxxx ',
                      'Dec  1 23:59:06 sslmftp1 ftpd[4925]: PASS password',
                      'Dec  1 23:59:06 sslmftp1 ftpd[4925]: FTP LOGIN FROM 192.0.2.0 [192.0.2.0], xxxxxx ',
                      'Dec  1 23:59:07 sslmftp1 ftpd[4925]: PWD',
                      'Dec  1 23:59:08 sslmftp1 ftpd[4925]: CWD /test/data/944837/',
                      'Dec  1 23:59:09 sslmftp1 ftpd[4925]: TYPE Image'
                    ]
        };

You now have the lines for each session initiated by $ip in a hash, with session PIDs as keys. I simply printed it with Data::Dumper, but you can manipulate the hash however you like.

score 1 · Answer 4 · answered Apr 03 '14 at 16:26

1

Reading through a file is slower than looping through an array. If the input isn't too large, you should load it into an array and loop through that array instead:

@input = <INPUT>;
foreach my $mPID(@pids){
    foreach my $line (@input) {
        ...

If the input is too large, then maybe you could reverse the order of the loops, so that you're still only reading through the file one time:

while(my $line = <INPUT>){
    foreach my $mPID(@pids){
        if ($line =~ /$mPID/){
            push @{$flow{$mPid}}, $line;
        }
    }
}

open (OUTPUT, '>'.$output) or die "Couldn't read $output.\n";
foreach my $mPid (@pids) {
    if (@{$flow{$mPid}}) {
        print OUTPUT @{$flow{$mPid}}, "################\n";
    }
}
close (OUTPUT);

answered Apr 03 '14 at 16:26

mob

117,087
18
149
283

The file I'm reading in is 440932 lines long, would that be too big to read into an array? – cycloxr Apr 03 '14 at 16:30
`print OUTPUT $_ for @{$flow{$mPid}}, "################\n"` is a very simple optimisation – Borodin Apr 03 '14 at 16:39
I'm afraid that `print for OUTPUT @{$flow{$mPid}}, "################\n` doesn't work, and using foreach would be slower than original. – mpapec Apr 03 '14 at 16:42
@mpapec: Sorry, fixed – Borodin Apr 03 '14 at 16:43
@mpapec: Are you being serious here? You don't seem to be very constructive. I am talking about economy of memory. There is no reason to push what may be a huge array onto the stack. The time economy of making only a single procedure call will be swamped by the time taken to write to disk. – Borodin Apr 03 '14 at 16:49
The keywords are `may be`. – mpapec Apr 03 '14 at 17:25
2

@Borodin You're making a lot of assumptions about how Perl works, some of them wrong. `print for @array` is likely to be slower than `print @array` because the former is N op calls while the latter is 1. Because Perl ops are highly optimized C code, any time you can reduce the number of operations in a Perl script is probably going to be a win. Arguments are not copied into functions, they are aliased, so there's no memory copy done. Of course, benchmark and profile first. – Schwern Apr 03 '14 at 17:39
1

They have subtly different behaviors because of `$/`. Run it with `-l` to see what I mean. I whipped up a [benchmark](https://gist.github.com/schwern/9959154) to double check because, with a few dozen layers between you and the hardware, even if you think you how a computer is behaving internally, you're probably wrong. – Schwern Apr 03 '14 at 17:45
@Schwern I can confirm your benchmark, adding only that list always performs at least 30% better. – mpapec Apr 03 '14 at 17:59
@Schwern: I am clear about how Perl parameter passing works, and I explained that I was concerned with memory rather than speed. (A saving of a tenth of a second is pointless when it happens only once at the end of the program.) In your *gist*, the array is just 1 million scalar values, which is the same as the contents of the stack when they are all passed at once. (A scalar value isn't any smaller if it happens to be an alias.) That means 20MB are being pushed onto the stack for the call. It is also deceptive to write to `/dev/null`, which is bound to behave very much faster than a disk file – Borodin Apr 03 '14 at 18:42
@Schwern: I am surprised by the size of the difference in performance (which I calculate at around 20% when writing to a disk) but for datasets like `my @list = map sprintf('%07d', $_), 1 .. 1_000_000` I get another 20% performance benefit using `print {$fh} join '', @list`. If speed is the holy grail then this should be considered, at the cost of even more memory usage. – Borodin Apr 03 '14 at 18:47
@Borodin 20MB is not being pushed onto the stack, one AV pointer is. You can [verify this roughly using ps](https://gist.github.com/schwern/9962275). Note the resident memory size is roughly the same for all instances. I can't account for why the `join` trick is faster, it shouldn't be, I'd call it an opportunity for optimization inside `print`. – Schwern Apr 03 '14 at 20:29
@Borodin Ah, `print @list` has to do the extra work of joining on `$,`. I'll bet `join` has a special case for empty string. There's an opportunity for optimizing `print` when `$,` is empty. `print @list` is probably also generating more IO calls to fit in the IO buffer. [Updated the benchmark](https://gist.github.com/schwern/9959154). Anyhow, none of this micro-optimization should be the concern of the OP. – Schwern Apr 03 '14 at 20:45
@Schwern: *"20MB is not being pushed onto the stack, one AV pointer is"*. No, Perl builds `@_` anew for each call, and it is a separate array from anything named in the call. It has to, otherwise how would it pass the parameters to `print @list, 42`? *"Anyhow, none of this micro-optimization should be the concern of the OP"* I completely agree :) – Borodin Apr 03 '14 at 21:21
@Borodin [I looked into it with Devel::Peek](https://gist.github.com/schwern/9964160) and you're right... at least for user functions. It does create a new AV (I'd have thought it wouldn't for `print @list` as opposed to `print @list, "foo"`) but all the SVs are aliases. However, [`foo(@list)` uses 8 megs more resident memory than `print @list` or `print @list, "foo"`](https://gist.github.com/schwern/9964225) which suggests `print` is doing something different than a user function. – Schwern Apr 03 '14 at 22:34

Pattern match loop is slow, ways to speed it up?

4 Answers4

Output: