How to extract lines between a starting and ending regular expression in Perl

Question

I want to use Perl to loop through a file (or an array), start processing elements when a regular expression is matched and stopping processing when another regular expression is met.

One way to do it is to have a variable used as a flag (=1 when starting regex is met, and =0 when ending regex is met).

For example, the following works but is awfully ugly!!

use strict;

my @file = (
    "<title>List of widgets</title>\n",
    "<widgets>\n",
    "   <button>widget001.xml</button>\n",
    "   <textArea>widget002.xml</textArea>\n",
    "   <menu>widget002.xml</menu>\n",
    "</widgets>\n",
    "<footer>\n",
    "   This is the footer\n",
    "</footer>\n",
);

my $in_list_widgets = 0;
for my $line (@file) {
    if ($line=~m%<widgets%) {
        $in_list_widgets = 1;
    } elsif ($line=~m%</widgets>%) {
        $in_list_widgets = 0;
    } else {
        if ($in_list_widgets == 1) {
            &process_line($line);
        } else {
            #Do nothing
        }
    }
}

sub process_line {
    my $line = shift;
    print $line;
}

What would be a more elegant way to do it and still get the same result?

<button>widget001.xml</button>
<textArea>widget002.xml</textArea>
<menu>widget002.xml</menu>

Thanks

Possible duplicate of [How can I grab multiple lines after a matching line in Perl?](http://stackoverflow.com/questions/1040657/how-can-i-grab-multiple-lines-after-a-matching-line-in-perl) — pilcrow, Jun 24 '16 at 13:46
This looks like XML. Is it XML? Because if so - an XML parser can do it quite trivially. — Sobrique, Jul 18 '16 at 14:32

score 1 · Accepted Answer · answered Jul 18 '16 at 14:38

1

On the offchance this is XML - and it looks like it is - I would suggest an XML parser.

#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> parse ( \*DATA );
$twig -> set_pretty_print('indented');

$_ -> print for map { $twig -> findnodes("//$_",0) } qw ( button textArea menu );

__DATA__
<root>
  <title>List of widgets</title>
  <widgets>
    <button>widget001.xml</button>
    <textArea>widget002.xml</textArea>
    <menu>widget002.xml</menu>
  </widgets>
  <footer>
   This is the footer
</footer>
</root>

Outputs:

<button>widget001.xml</button>
<textArea>widget002.xml</textArea>
<menu>widget002.xml</menu>

Or for the sake of clarity:

my $twig = XML::Twig -> new -> parsefile('your_file'); 
foreach my $widgets ( $twig -> root -> children('widgets') ) {
   foreach my $child ( $widgets -> children ) { 
      $child -> print;
      print "\n";
   }
}

answered Jul 18 '16 at 14:38

Sobrique

52,974
7
60
101

Nice code. XML parser does the trick nicely indeed. Actually, I was posting this question to remember and share about flip-flow operator for a more generic case (and not only XML). Thanks again for your time. – Jean-Francois T. Jul 19 '16 at 03:51
The range operator is useful, but a horrible choice for XML, because of formatting, nested nodes and context. – Sobrique Jul 19 '16 at 05:34
True. Unless if you are 200% sure there are now nested nodes (e.g. no `` in ``). – Jean-Francois T. Jul 19 '16 at 07:16
Even then. The point of XML is that it is a data transfer language. The spec is strict on what it does or doesn't allow. Someone generating XML will be following that spec. Which means if the person processing the XML isn't, then one day it might break mysteriously because of an upstream data change in an otherwise perfectly valid way, according to that spec. That's really bad design. – Sobrique Jul 19 '16 at 07:19

score 0 · Answer 2 · answered Jun 24 '16 at 10:54

You could use the syntax <match_regex_1> .. <match_regex_2> like this:

use strict;

my @file = (
    "<title>List of widgets</title>\n",
    "<widgets>\n",
    "   <button>widget001.xml</button>\n",
    "   <textArea>widget002.xml</textArea>\n",
    "   <menu>widget002.xml</menu>\n",
    "</widgets>\n",
    "<footer>\n",
    "   This is the footer\n",
    "</footer>\n",
);

my $in_list_widgets = 0;
for my $line (@file) {
    if ($line=~m%<widgets% .. $line=~m%</widgets>%) {
        &process_line($line) if ($line!~m%<(widgets|/widgets>)%);
    } else {
        #Do nothing
    }
}

sub process_line {
    my $line = shift;
    print $line;
}

Some explanations:

if ($line=~m%<widgets% .. $line=~m%</widgets>%): start to execute following block when first condition is true, until last condition is true.
&process_line($line) if ($line!~m%<(widgets|/widgets>)%);: without the if ($line!~m%..., the lines <widgets> and </widgets> would be processed as well

Hope it can help.

It's called the _flip flop operator_ (in case you want to google for it or something). — PerlDuck, Jun 24 '16 at 13:46

How to extract lines between a starting and ending regular expression in Perl

2 Answers2