1

My question is the same question as How do I re.search or re.match on a whole file without reading it all into memory? but using perl instead of python

Question: I want to be able to run a regular expression on an entire file, but I'd like to be able to not have to read the whole file into memory at once as I may be working with rather large files in the future. Is there a way to do this? Thanks!

Clarification: I cannot read line-by-line because it can span multiple lines.

Why am I using perl instead of python? I've run into enough issues with python regex that I need to switch to perl. I would install https://pypi.python.org/pypi/regex but I can't since my workplace, understandably, doesn't allow write access to its python installation directory and I would prefer to avoid a slow back and forth email chain with IT for them to install it for me, and/or deal with further permission issues :)

EDIT: Example patterns I'm looking for

assign signal0 = (cond1) ? val1 :
                 (cond2) ? val2 :
                           val3;

assign signal1[15:0] = {input1[7:0], input2[7:0]};

assign signal2[34:0] = { 4'b0,
                         subsig0[3:0],
                         subsig1,
                         subsig2,
                         subsig3[18:2],
                         subsig4[5:0]
                       };

I'm looking for patterns like the above, i.e. a variable assignment up until I see a semicolon. The regex would match any of the above as I don't know if the pattern is multiline or not. Perhaps something similar to /assign\s+\w+\s+=\s+[^;];/m, i.e. up until I see a semicolon

EDIT2: From the given answers (thanks!) it appears that decomposing the pattern into start, middle, & end sections might be the best strategy, e.g. using the range operator as suggested by some.

Community
  • 1
  • 1
mgoblue92
  • 57
  • 6
  • So you would like to only keep the part that you match against in memory? You could then try to progressivly read lines until you get a complete match, and abandon lines in the beginning if there is no match – Håkon Hægland Mar 22 '17 at 16:34
  • 3
    Wait, so your company is okay with you switching your development work to a completely different language, but they won't let you install a module??? – ThisSuitIsBlackNot Mar 22 '17 at 16:34
  • Also, have you heard of [virtualenv](http://docs.python-guide.org/en/latest/dev/virtualenvs/)? You can install all the modules you want in your home directory, you don't need root access. – ThisSuitIsBlackNot Mar 22 '17 at 16:36
  • @ThisSuitIsBlackNot I suppose they're afraid someone could unintentionally take down python for the whole company? ;) – mgoblue92 Mar 22 '17 at 16:39
  • @HåkonHægland In general I would like to avoid reading the whole into a string/into memory. I could read line by line, but it would be much harder to look for the multi-line pattern – mgoblue92 Mar 22 '17 at 16:41
  • 2
    It would help if you showed the pattern and an example of the data you want to match. – ThisSuitIsBlackNot Mar 22 '17 at 16:43
  • @ThisSuitIsBlackNot good suggestion, I updated the post – mgoblue92 Mar 22 '17 at 16:59
  • I don't know your data, but perhaps you can assemble lines and test for a match each time, then clear that once a match is found? Like in [this answer](http://stackoverflow.com/questions/42866626/perl-read-a-large-file-for-use-with-multi-line-regex/42866745#42866745). – zdim Mar 22 '17 at 17:15

4 Answers4

4

You can use the range operator to match everything between two patterns while reading line-by-line:

use strict;
use warnings 'all';

while (<DATA>) {
    print if /^assign / .. /;/;
}

__DATA__
foo
assign signal0 = (cond1) ? val1 :
                 (cond2) ? val2 :
                           val3;
bar
assign signal1[15:0] = {input1[7:0], input2[7:0]};
baz
assign signal2[34:0] = { 4'b0,
                         subsig0[3:0],
                         subsig1,
                         subsig2,
                         subsig3[18:2],
                         subsig4[5:0]
                       };
qux

Output:

assign signal0 = (cond1) ? val1 :
                 (cond2) ? val2 :
                           val3;
assign signal1[15:0] = {input1[7:0], input2[7:0]};
assign signal2[34:0] = { 4'b0,
                         subsig0[3:0],
                         subsig1,
                         subsig2,
                         subsig3[18:2],
                         subsig4[5:0]
                       };
ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
3

You can set the input record separator $/ to semicolon ; and read line by line. Each line will be on statement, including the trailing semicolon. Then matching becomes trivial.

simbabque
  • 53,749
  • 8
  • 73
  • 136
2

I can imagine two solutions (without thinking heavily, so maybe I am wrong):

a) Use a maximum number of matching characters, say 1024. 1) Read in twice as many (2048) characters. 2) Try to match. 3) Seek forward by 1024 characters. Repeat.

b) Use a starting and ending pattern that match in a single line. The part in between can be tested later on. Perl's flip-flop operator can be used in this scenario.

Edit: Since the question got updated, solution b) seems to be a good one.

The starting pattern would be the assignment, and the ending pattern would be the semicolon. Everything in between can be concatenated and later tested for validity.

Example:

my $assignment = "";
while (<>) {
    if (/assign\s+\w+\s+=/ .. /;/) {
        $assignment .= $_;
    } else {
        if ($assignment =~ /full regex/) {
            # do something with the match
        }
        $assignment = "";
    }
}
Matthias
  • 1,005
  • 7
  • 20
1

Here is an example using progressive matching with a pre match pattern:

use feature qw(say);
use strict;
use warnings;

my $pre_match = qr{assign\s+\S+\s+=\s+};
my $regex = qr{($pre_match[^;]+;)};

my $line = "";
my $found_start = 0;
while( <DATA> ) {
    if ( !$found_start && /$pre_match/ ) {
        $line = "";
        $found_start = 1;
    }
    if ( $found_start ) {
        $line .= $_;
        if ( $line =~ /$regex/ ) {
            say "Got match: '$1'";
            $found_start = 0;
            $_ = substr $line, $+[0];
            redo;
        }
    }
}

__DATA__
assign signal0 = (cond1) ? val1 :
                 (cond2) ? val2 :
                           val3;

assign signal1[15:0] = {input1[7:0], input2[7:0]};

assign signal2[34:0] = { 4'b0,
                         subsig0[3:0],
                         subsig1,
                         subsig2,
                         subsig3[18:2],
                         subsig4[5:0]
                       };

Output:

Got match: 'assign signal0 = (cond1) ? val1 :
                 (cond2) ? val2 :
                           val3;'
Got match: 'assign signal1[15:0] = {input1[7:0], input2[7:0]};'
Got match: 'assign signal2[34:0] = { 4'b0,
                         subsig0[3:0],
                         subsig1,
                         subsig2,
                         subsig3[18:2],
                         subsig4[5:0]
                       };'
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174