Regex not matching if order of items in file changes

Question

This is my first attempt at Perl so I know this code is ugly. Some of its from not knowing what I'm doing and some of its from troubleshooting various issues. What I'm trying to do is search a file(samplefile.txt) for various information (the 9 parse_updates functions) and it works fine unless the order changes. For example if one samplefile has Certificate Bundle before Botnet Definitions then it will fail to find the Certificate Bundle information. I expected each function to start searching the samplefile "fresh" but that doesn't seem to be the case and I can't figure out why. Not including a samplefile as the code post is already long enough and I think the problem is in my function logic.


use strict;
use warnings;
use diagnostics;
use File::Slurp;
my @autoupdate;
my $autoupdate;

my $av_regex;
my @av_updates;

my $avdev_regex;
my @avdef_updates;

my $ipsatt_regex;
my @ipsatt_updates;

my $attdef_regex;
my @attdef_updates;


my $ipsmal_regex;
my @ipsmal_updates;

my $flowav_regex;
my @flowav_updates;

my $botnet_regex;
my @botnet_updates;

my $appdef_regex;
my @appdef_updates;

my $ipgeo_regex;
my @ipgeo_updates;

my $certbun_regex;
my @certbun_updates;

my $str1;
my $str2;
my $str3;
my $str4;
my $str5;
my $str6;
my $str7;
my $str8;
my $str9;


 
parse_updates1(); #AV Engine
parse_updates2(); #Virus Defs
parse_updates3(); #IPS Attack Engine
parse_updates4(); #Attack Defs
parse_updates5(); #IPS Mal URL DB
parse_updates6(); #Flow virus Defs
parse_updates7(); #Botnet Defs
parse_updates8(); #IP Geo DB
parse_updates9(); #Cert Bundle


sub parse_updates1{
print "\nTHIS IS AV Engine Section!!\n\n";
read_file('samplefile.txt', buf_ref => \$str1);


my $av_regex =qr/(AV Engine)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;
if ( $str1 =~ /$av_regex/g ) {
  #putting each regex group into the array
  push @av_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds
  chomp @av_updates;

     print "$_\n" for @av_updates;

}
else {
  print "\n\nGot Nothing!\n\n";
  @av_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @av_updates;

}
}
sub parse_updates2{
read_file('samplefile.txt', buf_ref => \$str2);

print "\nTHIS IS Virus Definitions Section!!\n\n";

my $avdef_regex =qr/(Application Definitions)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str2 =~ /$avdef_regex/g ) {
  #putting each regex group into the array
  push @avdef_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @avdef_updates;
 
print "$_\n" for @avdef_updates;

}
else {
  print "\n\nGot Nothing!\n\n";
  @avdef_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @avdef_updates;


}
}
sub parse_updates3{
read_file('samplefile.txt', buf_ref => \$str3);

 print "\nTHIS IS IPS Attack Engine Section!!\n\n";

my $ipsatt_regex =qr/(IPS Attack Engine)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str3 =~ /$ipsatt_regex/g ) {
  #putting each regex group into the array
  push @ipsatt_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @ipsatt_updates;
 

print "$_\n" for @ipsatt_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @ipsatt_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @ipsatt_updates;

}
}
sub parse_updates4{
read_file('samplefile.txt', buf_ref => \$str4);

 print "\nTHIS IS Attack Definitions Section!!\n\n";

my $attdef_regex =qr/(Attack Definitions)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str4 =~ /$attdef_regex/g ) {
  #putting each regex group into the array
  push @attdef_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @attdef_updates;
 

print "$_\n" for @attdef_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @attdef_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @attdef_updates;

}
}
sub parse_updates5{
read_file('samplefile.txt', buf_ref => \$str5);

print "\nTHIS IS IPS Malicious URL Database Section!!\n\n";

my $ipsmal_regex =qr/(IPS Malicious URL Database)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str5 =~ /$ipsmal_regex/g ) {
  #putting each regex group into the array
  push @ipsmal_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @ipsmal_updates;
 

print "$_\n" for @ipsmal_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @ipsatt_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @ipsatt_updates;

}
}
sub parse_updates6{
read_file('samplefile.txt', buf_ref => \$str6);

print "\nTHIS IS Flow-Based Virus Definitions Section!!\n\n";

my $flowav_regex =qr/(Flow-based Virus Definitions)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str6 =~ /$flowav_regex/g ) {
  #putting each regex group into the array
  push @flowav_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @flowav_updates;
 

print "$_\n" for @flowav_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @flowav_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @flowav_updates;

}
}
sub parse_updates7{
read_file('samplefile.txt', buf_ref => \$str7);

print "\nTHIS IS Botnet Definitions Section!!\n\n";

my $botnet_regex =qr/(Botnet Definitions)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str7 =~ /$botnet_regex/g ) {
  #putting each regex group into the array
  push @botnet_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @botnet_updates;
 

print "$_\n" for @botnet_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @botnet_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @botnet_updates;

}
}
sub parse_updates8{
read_file('samplefile.txt', buf_ref => \$str8);


print "\nTHIS IS IP geography DB Section!!\n\n";

my $ipgeo_regex =qr/(IP Geography DB)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str8 =~ /$ipgeo_regex/g ) {
  #putting each regex group into the array
  push @ipgeo_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @ipgeo_updates;
 

print "$_\n" for @ipgeo_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @ipgeo_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @ipgeo_updates;

}
}
sub parse_updates9{
read_file('samplefile.txt', buf_ref => \$str9);


print "\nTHIS IS Certificate Bundle Section!!\n\n";

my $certbun_regex =qr/(Certificate Bundle)(.*\n)*?(Version:)(.*\n)*?(Contract Expiry Date:)(.*\n)*?(Last Updated using )(.*\n)*?(Last Update Attempt: )(.*\n)*?(Result: )(.*\n).*/p;

if ( $str9 =~ /$certbun_regex/g ) {
  #putting each regex group into the array
  push @certbun_updates, $1, $2, $3 ,$4, $5, $6, $7, $8, $9, $10, $11, $12;
  #Removing new linefeeds 
  chomp @certbun_updates;
 

print "$_\n" for @certbun_updates;
}
else {
  print "\n\nGot Nothing!\n\n";
  @certbun_updates = qw(notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound notfound);
    print "$_\n" for @certbun_updates;

}


# End of sub parse_updates
}

`if ( /.../g )` makes no sense. Drop the `g`? Otherwise, way too much too look at. Please narrow down the problem — ikegami, Jul 25 '20 at 05:23
Way way too much for a reasonable question. And needlessly so: all your functions do the same, exactly the same (as far as I can tell by scanning quickly through all that.) So you need an array with patterns, one function. There's a bunch of other specifics but first -- there is not even a sample of data, and any of the regex could, be the problem. I suggest to rework the question. — zdim, Jul 25 '20 at 05:41
I closevote for "needs debugging details". Please do not misunderstand as "needs more code". On the contrary, please provide a [mre]. — Yunnosch, Jul 25 '20 at 06:17
To be clear, that non-sense `g` is likely the cause of your problems. But I only looked at your code for 5 seconds. — ikegami, Jul 25 '20 at 14:17
As i've hinted in my answer I too find it likely that `if (/.../g)` may be _the_ cause for trouble (not blaimiing you!). I've now added a section to the end of my answer that details how that may come about. Can't say anything definitive without seeing some data though. — zdim, Jul 25 '20 at 18:05

zdim · Accepted Answer · 2020-07-31T05:31:03.690

Even though the question can't be answered definitively without seeing some data, I'd like to offer a re-write of that program first. That may also solve the problem(s).

There is no reason for all those functions; they all do exactly the same. There is no need for the sea of variables either; hashes are good for collections of named things. I keep at least some of the original choices, like the overall flow, use of File::Slurp, etc.

use warnings;
use strict;
use feature 'say';    

use Data::Dump qw(dd);
use File::Slurp;

my $fname = shift // die "Usage: $0 file\n";   #/

my %update = (
    av => { 
        re => qr/pattern-for-av/,
        name => q(AV Engine Section),
    },
    avdev => { 
        re => qr/pattern-for-avdev/, 
        name => q(Virus Definitions Section),
    },
    # ...
);

my $file_content = read_file($fname);

foreach my $code (sort keys %update) {
    say "This is $update{$code}{name}";
    my $captures = parse_update( $file_content, $update{$code}{re} );
    $update{$code}{captures} = $captures;
}    
dd \%update;

sub parse_update {
    my ($file_content, $re) = @_;

    my @captures = $file_content =~ /$re/;  
    if (not @captures) {
        say "Got nohting!";
        @captures = ( 'notfound' ) x 12;  # apparently exactly 12
    }
    else { chomp @captures }

    say for @captures;  

    return \@captures;
}

The regex patterns and section names are all in the hash %update, and results (captures) are then added. This choice of data organization is a bit arbitrary, as I don't know the context.

The file is opened once and all its content copied to the sub repeatedly. Please adjust if/as needed. There are other ways to make that data available to the sub, if the file is huge for example.

That if (/.../g), used in the question and seen occasionally, makes no sense and can easily be very wrong -- and can also cause problems of the kind described in the question.^† The /g modifier serves complex needs when used in scalar context, not meant for a lone if statement.

The conditions for successful match (and thus capturing) are taken from the question. The code in the sub can be organized in a number of other ways, from more compact to more elaborate.

Note that the sub uses nothing directly from the higher scope; all it needs is explicitly passed to it, and its result is returned. This is very important, so to avoid coupling of code components meant to be distinct (here the sub and its callers); they could reside even across different compilation units.

This rewrite may well have scooped up error(s) and solved the problem; or it may have not. If we can get to see a sample of data then a more pointed troubleshooting may be possible.

The code above has been tested, with a made-up file and suitable regex patterns.

^† While I'd need to see some data to establish what causes the reported behavior, a good candidate is the unsuspecting use of if (/.../g). That modifier makes the regex remember the position where it matched and the next time that regex is invoked on the same string it starts looking for the match from the position in the string of that previous match.

A simple example

use warnings; use strict; use feature 'say';

my $s = q(one simple string); 

if ($s =~ /(\w+)/g) { say $1 }; 
if ($s =~ /(\w+)/g) { say $1 }; 
say pos($s);

which prints

one
simple
10

where the last line is the position in the string at that point as tracked by regex; right after the second match. (The pos function is great for seeing some of what goes on in a regex operation.) So when invoked again after a match the regex continues from where it left off, by courtesy of /g modifier; without it the string is scanned from the beginning in new invocations.

Another example, with a single expression executed repeatedly

use warnings; use strict; use feature 'say';

my $s = q(one two);

sub func { say $1 if $_[0] =~ /(\w+)/g };  # /g is of consequence!

for (1..4) { func($s) }

This prints

one
two

and it's done; there's no more. That's because the engine got past the word two in the second match, and so there is nothing to match in the next iterations of the for loop.

See this post and this post for more about the above examples and their context.

This second example in particular is eerily similar to what is given in the question.

Some of the above behavior can be modified by anchors and other modifiers, and the /g is of course of great use -- but one need be aware of what it does.

Regex not matching if order of items in file changes

1 Answers1