How do I grab an unknown number of captures from a pattern?

Question

Lets say I have a pattern:

<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2

As you can see, the pattern could have multiple values (in this case pin has 2 sets of pin names). The amount is unknown.

How would I parse this? Here is what I have so far, but it is not helpful as it does not take into account if the pattern has more than 2 sets of pins.

my $pattern = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";

if ( $pattern =~ m#\<cell\> (\w*=\w*) \<pin\> (\w*=\w*) \<type\> (\w*=\w*)#) {

my $cell_name = $1;
my $pin_name = $2;
my $type_name = $3;
}

as you can see, this will only work if there is only one set of pin names. However I want it to be able to adjust to multiple unknown sets of pin names. I think I would have to construct like an array or hash, but I am not really sure what is the best way of grabbing these values taking into account the unknown multiple pin sets.

I would like to be able to store the cell_name,pin_name and type_name as an array or hash with the sets of values.

brian d foy · Answer 1 · 2020-07-08T02:24:52.723

Your problem is a bit trickier than Why do I get the first capture group only? but some of those ideas may help. The trick is to stop thinking about doing everything in a single pattern.

If that's really your input, I'd be tempted to match groups of things around an =. Matching in list context, such as assigning to a hash, returns the list of matches:

use Data::Dumper;

my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
my %values = $input =~ m/ (\S+) = (\S+) /gx;

print Dumper( \%values );

The things before the = become keys and the things after become the values:

$VAR1 = {
          'pin1' => 'pin2',
          'type1' => 'type2',
          'cell1' => 'cell2',
          'pin3' => 'pin4'
        };

But life probably isn't that easy. The example names probably don't really have pin, cell, and so on.

There's another thing I like to do, though, because I miss having all that fun with sscan. You can walk a string by matching part of it at a time, then on the next match, start where you left off. Here's the whole thing first:

use v5.10;

use Data::Dumper;

my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";

my %hash;
while( 1 ) {
    state $type;

    if( $input =~ /\G < (.*?) > \s* /xgc ) {
        $type = $1;
        }
    elsif( $input =~ /\G (\S+) = (\S+) \s* /xgc ) {
        $hash{$type}{$1}{$2}++;
        }
    else { last }
    }

print Dumper( \%hash );

And the data structure, which really doesn't matter and can be anything that you like:

$VAR1 = {
          'type' => {
                      'type1' => {
                                   'type2' => 1
                                 }
                    },
          'pin' => {
                     'pin1' => {
                                 'pin2' => 1
                               },
                     'pin3' => {
                                 'pin4' => 1
                               }
                   },
          'cell' => {
                      'cell1' => {
                                   'cell2' => 1
                                 }
                    }
        };

But let's talk about his for a moment. First, all of the matches are in scalar context since they are in the conditional parts of the if-elsif-else branches. That means they only make the next match.

However, I've anchored the start of each pattern with \G. This makes the pattern match at the beginning of the string or the position where the previous successful match left off when I use the /g flag in scalar context.

But, I want to try several patterns, so some of them are going to fail. That's where the /c flag comes in. It doesn't reset the match position on failure. That means the \G anchor won't reset on an unsuccessful match. So, I can try a pattern, and if that doesn't work, start at the same position with the next one.

So, when I encounter something in angle brackets, I remember that type. Until I match another thing in angle brackets, that's the type of thing I'm matching. Now when I match (\S+) = (\S+), I can assign the matches to the right type.

To watch this happen, you can output the remembered string position. Each scalar maintains its own cursor and pos(VAR) returns that position:

use v5.10;

use Data::Dumper;

my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";

my %hash;
while( 1 ) {
    state $type;

    say "Starting matches at " . ( pos($input) // 0 );
    if( $input =~ /\G < (.*?) > \s* /xgc ) {
        $type = $1;
        say "Matched <$type>, left off at " . pos($input);
        }
    elsif( $input =~ /\G (\S+) = (\S+) \s* /xgc ) {
        $hash{$type}{$1}{$2}++;
        say "Matched <$1|$2>, left off at " . pos($input);
        }
    else {
        say "Nothing left to do, left off at " . pos($input);
        last;
        }
    }

print Dumper( \%hash );

Before the Dumper output, you now see the global matches in scalar context walk the string:

Starting matches at 0
Matched <cell>, left off at 7
Starting matches at 7
Matched <cell1|cell2>, left off at 19
Starting matches at 19
Matched <pin>, left off at 25
Starting matches at 25
Matched <pin1|pin2>, left off at 35
Starting matches at 35
Matched <pin3|pin4>, left off at 45
Starting matches at 45
Matched <type>, left off at 52
Starting matches at 52
Matched <type1|type2>, left off at 63
Starting matches at 63
Nothing left to do, left off at 63

Finally, as a bonus, here's a recursive decent grammar that does it. It's certainly overkill for what you've provided, but does better in more tricky situations. I won't explain it other than to say it produces the same data structure:

use v5.10;

use Parse::RecDescent;
use Data::Dumper;

my $grammar = <<~'HERE';
    startrule: context_pairlist(s)
    context_pairlist: context /\s*/ pair(s)
    context: '<' /[^>]+/ '>'
        { $::context = $item[2] }
    pair: /[A-Za-z0-9]+/ '=' /[A-Za-z0-9]+/
        { main::build_hash( $::context, @item[1,3] ) }
    HERE

my $parser = Parse::RecDescent->new( $grammar );

my %hash;
sub build_hash {
    my( $context, $name, $value ) = @_;
    $hash{$context}{$name}{$value}++;
    }

my $input = "<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
$parser->startrule( $input );

say Dumper( \%hash );

score 2 · Answer 2 · answered Jul 14 '20 at 11:17

You have space separated tokens. Some tokens indicate a new scope and some tokens indicate values being set in that scope. I find it most straightforward to just run through the list of tokens in this case:

#!/usr/bin/env perl

use feature 'say';
use strict;
use warnings;

my $s = q{<cell> cell1=cell2 <pin> pin1=pin2 pin3=pin4 <type> type1=type2};

my (%h, $k);
while ($s =~ /(\S+)/g) {
    my ($x, $y)= split /=/, $1;

    if (defined $y) {
     push $h{$k}->@*, {key => $x, value => $y};
     next;
    }

    $h{$k = $x} = [];
}

use Data::Dumper;
print Dumper \%h;

Note that this method considers everything that is not an assignment to be a scope marker.

The resulting data structure is suitable for feeding into something else. Using {key => $key, 'value' => $value} instead of {$key => $value} allows 1) straightforward handling of assignments in a scope upstream; and 2) actually allows the same identifier to be assigned multiple times in a scope (giving you an opportunity to detect this if so desired):

$VAR1 = {
          '<cell>' => [
                        {
                          'value' => 'cell2',
                          'key' => 'cell1'
                        }
                      ],
          '<pin>' => [
                       {
                         'value' => 'pin2',
                         'key' => 'pin1'
                       },
                       {
                         'value' => 'pin4',
                         'key' => 'pin3'
                       }
                     ],
          '<type>' => [
                        {
                          'key' => 'type1',
                          'value' => 'type2'
                        }
                      ]
        };

score 1 · Answer 3 · answered Jul 12 '20 at 23:19

Another approach is to split the $pattern into an array where each tag starts a new row. This makes it easier to extract the relevant data as this example shows:

#!/usr/bin/perl

$pattern="<cell> cell1=1234567890 <pin> pin1=pin2 pin3=pin4 <type> type1=type2";
%cell=%pin=%type=();

print "Original pattern =$pattern\n";

($pattern_split=$pattern) =~ s/</\n</g;
@array=split(/\n/, $pattern_split);

# Extract relevant data (NOTE: finetune regex here) and store them in appropriate hashes indexed by $cnum (cellphone number)
for $_ (@array) {
  /<cell>\s*\w+=(\w+)/ && do { $cnum = $1; $cell{$cnum} = $cnum };
  /<pin>\s*(.+?)\s*$/ && do { $pin_list=$1; $pin{$cnum} = $pin_list };
  /<type>\s*\w+=(\w+)/ && do { $type{$cnum} = $1 };
}

$cn="1234567890";
print "Result: Cellnumber '$cell{$cn}' has pin_list='$pin{$cn}' and type='$type{$cn}'\n";

Prints:

Original pattern =<cell> cell1=1234567890 <pin> pin1=pin2 pin3=pin4 <type> type1=type2
Result: Cellnumber '1234567890' has pin_list='pin1=pin2 pin3=pin4' and type='type2'

How do I grab an unknown number of captures from a pattern?

3 Answers3

Linked