52

In Perl, how can I use one regex grouping to capture more than one occurrence that matches it, into several array elements?

For example, for a string:

var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello

to process this with code:

$string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my @array = $string =~ <regular expression here>

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
  print $i.": ".$array[$i]."\n";
}

I would like to see as output:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello

What would I use as a regex?

The commonality between things I want to match here is an assignment string pattern, so something like:

my @array = $string =~ m/(\w+=[\w\"\,\s]+)*/;

Where the * indicates one or more occurrences matching the group.

(I discounted using a split() as some matches contain spaces within themselves (i.e. var3...) and would therefore not give desired results.)

With the above regex, I only get:

0: var1=100 var2

Is it possible in a regex? Or addition code required?

Looked at existing answers already, when searching for "perl regex multiple group" but not enough clues:

Community
  • 1
  • 1
therobyouknow
  • 6,604
  • 13
  • 56
  • 73
  • 6
    TLDR, but +1 for doing your homework diligently. – DVK Aug 11 '10 at 15:38
  • BTW, I think that your problem is NOT multiple groups but the matching quotes. Which CAN be handled in Perl RegEx but very very carefully – DVK Aug 11 '10 at 15:40
  • 6
    http://ideone.com/Qvm2u – Alan Moore Aug 11 '10 at 21:33
  • @Alan: That is a great regex! – dawg Aug 12 '10 at 01:06
  • @Alan: works for me too! How very modest of you to post it as a comment rather than an answer though. I might have accepted as the answer if it was posted as an answer! Thank you very much though. Your answer is probably the most simplest (i.e. neatest solution) i.e. - which doesn't require supporting code e.g. a looping construct. – therobyouknow Aug 12 '10 at 10:28
  • P.S. I've not heard of ideone.com before, so @Alan, thanks for introducing me to this as well as your solution. – therobyouknow Aug 12 '10 at 10:37
  • 1
    Having filled in the gaps in your code, I still wasn't sure what part of it your question was about. Being a bit rushed as well, I just posted the link and bailed. Was it the way all the matches are accumulated in the array that you were trying to understand? – Alan Moore Aug 12 '10 at 10:56
  • Yes it was. But also to be able to treat 'internal' spaces i.e. bound by a pair of " as part of a matching pattern and not as a separator i.e. 'external' space that resides between matching patterns. Yours and other solutions do this. Thanks again. – therobyouknow Aug 12 '10 at 11:59
  • +1 just another upvote (to your last comment, can't upvote you twice on your solution comment) to show appreciation for your answer. Yours is probably the most simplest of pure regex solutions here. – therobyouknow Aug 12 '10 at 22:58

9 Answers9

48
my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) {
        print "<$1> => <$2>\n";
}

Prints:

<var1> => <100>
<var2> => <90>
<var5> => <hello>
<var3> => <"a, b, c">
<var7> => <test>
<var3> => <hello>

Explanation:

Last piece first: the g flag at the end means that you can apply the regex to the string multiple times. The second time it will continue matching where the last match ended in the string.

Now for the regex: (?:^|\s+) matches either the beginning of the string or a group of one or more spaces. This is needed so when the regex is applied next time, we will skip the spaces between the key/value pairs. The ?: means that the parentheses content won't be captured as group (we don't need the spaces, only key and value). \S+ matches the variable name. Then we skip any amount of spaces and an equal sign in between. Finally, ("[^"]*"|\S*)/ matches either two quotes with any amount of characters in between, or any amount of non-space characters for the value. Note that the quote matching is pretty fragile and won't handle escpaped quotes properly, e.g. "\"quoted\"" would result in "\".

EDIT:

Since you really want to get the whole assignment, and not the single keys/values, here's a one-liner that extracts those:

my @list = $string =~ /(?:^|\s+)((?:\S+)\s*=\s*(?:"[^"]*"|\S*))/g;
jkramer
  • 15,440
  • 5
  • 47
  • 48
  • 1
    The OP said one regex group was desired, and this captures into 2 regex groups... – dawg Aug 12 '10 at 01:19
  • Right, my fault. You can fix this be adding more parens around the key/value part of the regex. – jkramer Aug 12 '10 at 07:35
  • So you could do: http://ideone.com/7EQgz :- my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello"; my @array = (); while($string =~ /(?:^|\s+)(\S+)\s*=\s*("[^"]*"|\S*)/g) { push( @array, $1."=".$2 ); my @array = (); } for ( my $i = 0; $i < scalar( @array ); $i++ ) { print $i.": ".$array[$i]."\n"; } – therobyouknow Aug 12 '10 at 09:50
  • Or, http://ideone.com/otgyc -- which puts an extra set of brackets around the whole expression: my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello"; while($string =~ /((?:^|\s+)(\S+\s*=\s*"[^"]*"|\S*))/g) { print "<$1>\n"; } – therobyouknow Aug 12 '10 at 09:59
  • +1 to your answer as it does work. I'll have a look at all the other answers and come back soon to select an accepted answer. Thanks very much for your input and for explaining the regex too! – therobyouknow Aug 12 '10 at 10:00
  • In your second comment, I'd not include the leading `(?:^|\s+)` in the surrounding parens, as you're probably not interested in the spaces. Try this: while($string =~ /(?:^|\s+)((\S+)\s*=\s*("[^"]*"|\S*))/g) { push @list, $1; } – jkramer Aug 12 '10 at 10:15
  • 1
    Updated the post with a one-liner that extracts the complete var=value assignments. – jkramer Aug 12 '10 at 10:19
  • +1 for the update, and for it being a one-liner and not requiring supporting code. Thankyou. On further thoughts, splitting them as var, value pairs actually suits my requirement - you read my mind there. But a regex to find the combined pattern is instructive - perhaps if I want to apply the solution to other kinds of patterns for future problems (that I don't know about yet) - so your update facilitates this. – therobyouknow Aug 12 '10 at 12:58
  • Accepted answer. Thank you very much to everyone else - for every other solution that works I've, at least, +1 to your answer and probably +1 your comments. I've accepted @jkramer's answer because it is a pure regex solution (the oneliner at least) - the original requirement. Being a pure regex means it could likely be used in other regex capable languages as well, it's portable. Being a regex means it is transparent and has a fine level of granularity of extensibility/adjustment. @jkramer also explained their answer. But for the others - don't take this as a criticism at all... – therobyouknow Aug 12 '10 at 22:45
  • ...as everyone else provided some really excellent answers too - broadening my horizons & knowledge about how it could be done - particularly instead of purely using regex's: modules & routines. Also the approaches such as parsing. So, thank you to everyone. Many answers make this a rich & broad response to the problem. @jkramer was 1 of first to reply & has got many votes, so following crowd wisdom as well. But I realise everyone has busy schedules so again this is no criticism about replying early. A late, considered, response is very valuable too, as @Sinan Ünür and @gbacon demonstrate. – therobyouknow Aug 12 '10 at 22:51
  • Credit to @Alan Moore too for providing an answer as a comment and for introducing ideone.com! – therobyouknow Aug 12 '10 at 22:51
  • Why is it `while` and not `for` or `foreach` ? It looks strange coming from other languages – BeniBela May 19 '16 at 23:20
11

With regular expressions, use a technique that I like to call tack-and-stretch: anchor on features you know will be there (tack) and then grab what's between (stretch).

In this case, you know that a single assignment matches

\b\w+=.+

and you have many of these repeated in $string. Remember that \b means word boundary:

A word boundary (\b) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

The values in the assignments can be a little tricky to describe with a regular expression, but you also know that each value will terminate with whitespace—although not necessarily the first whitespace encountered!—followed by either another assignment or end-of-string.

To avoid repeating the assertion pattern, compile it once with qr// and reuse it in your pattern along with a look-ahead assertion (?=...) to stretch the match just far enough to capture the entire value while also preventing it from spilling into the next variable name.

Matching against your pattern in list context with m//g gives the following behavior:

The /g modifier specifies global pattern matching—that is, matching as many times as possible within the string. How it behaves depends on the context. In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.

The pattern $assignment uses non-greedy .+? to cut off the value as soon as the look-ahead sees another assignment or end-of-line. Remember that the match returns the substrings from all capturing subpatterns, so the look-ahead's alternation uses non-capturing (?:...). The qr//, in contrast, contains implicit capturing parentheses.

#! /usr/bin/perl

use warnings;
use strict;

my $string = <<'EOF';
var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello
EOF

my $assignment = qr/\b\w+ = .+?/x;
my @array = $string =~ /$assignment (?= \s+ (?: $ | $assignment))/gx;

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
  print $i.": ".$array[$i]."\n";
}

Output:

0: var1=100
1: var2=90
2: var5=hello
3: var3="a, b, c"
4: var7=test
5: var3=hello
Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
  • 1
    thanks for your contribution. Tried your solution, it works for me too -thanks! +1. Also thanks for suggesting your systematic approach/technique to regex building: "tack-and-stretch: anchor on features you know will be there (tack) and then grab what's between (stretch)." I'll read your answer more deeply when I've more time and feedback later. – therobyouknow Aug 12 '10 at 13:47
  • @Rob I'm glad it helps. Enjoy! – Greg Bacon Aug 12 '10 at 15:06
  • +1 That is a really great explanation of how you approached this problem. – dawg Aug 13 '10 at 00:20
8

I'm not saying this is what you should do, but what you're trying to do is write a Grammar. Now your example is very simple for a Grammar, but Damian Conway's module Regexp::Grammars is really great at this. If you have to grow this at all, you'll find it will make your life much easier. I use it quite a bit here - it is kind of perl6-ish.

use Regexp::Grammars;
use Data::Dumper;
use strict;
use warnings;

my $parser = qr{
    <[pair]>+
    <rule: pair>     <key>=(?:"<list>"|<value=literal>)
    <token: key>     var\d+
    <rule: list>     <[MATCH=literal]> ** (,)
    <token: literal> \S+

}xms;

q[var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello] =~ $parser;
die Dumper {%/};

Output:

$VAR1 = {
          '' => 'var1=100 var2=90 var5=hello var3="a, b, c" var7=test var3=hello',
          'pair' => [
                      {
                        '' => 'var1=100',
                        'value' => '100',
                        'key' => 'var1'
                      },
                      {
                        '' => 'var2=90',
                        'value' => '90',
                        'key' => 'var2'
                      },
                      {
                        '' => 'var5=hello',
                        'value' => 'hello',
                        'key' => 'var5'
                      },
                      {
                        '' => 'var3="a, b, c"',
                        'key' => 'var3',
                        'list' => [
                                    'a',
                                    'b',
                                    'c'
                                  ]
                      },
                      {
                        '' => 'var7=test',
                        'value' => 'test',
                        'key' => 'var7'
                      },
                      {
                        '' => 'var3=hello',
                        'value' => 'hello',
                        'key' => 'var3'
                      }
                    ]
G. Cito
  • 6,210
  • 3
  • 29
  • 42
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
  • 2
    +1 because I like the idea of the grammar concept (having studied them to an extent in Computer Science) though I haven't tried this answer. I like the grammar concept because this approach could be applied to solve even more complex problems, particularly in parsing code/data from a legacy obsolete language, for migration into a new language or data driven system/database -- which was actually the reason my original question (though I didn't mention it at the time.) – therobyouknow Aug 12 '10 at 10:22
  • 1
    I'd welcome you you to check out this module. Too often Regexs blur into Grammar -- and if you're going to write a Grammar with a Regex (not a bad idea) then this module is really dead on. Check out [my application of it to parse the `COPY` command in my psql shell](http://github.com/EvanCarroll/pgperlshell/blob/master/bdshell). – Evan Carroll Aug 12 '10 at 15:01
5

A bit over the top maybe, but an excuse for me to look into http://p3rl.org/Parse::RecDescent. How about making a parser?

#!/usr/bin/perl

use strict;
use warnings;

use Parse::RecDescent;

use Regexp::Common;

my $grammar = <<'_EOGRAMMAR_'
INTEGER: /[-+]?\d+/
STRING: /\S+/
QSTRING: /$Regexp::Common::RE{quoted}/

VARIABLE: /var\d+/
VALUE: ( QSTRING | STRING | INTEGER )

assignment: VARIABLE "=" VALUE /[\s]*/ { print "$item{VARIABLE} => $item{VALUE}\n"; }

startrule: assignment(s)
_EOGRAMMAR_
;

$Parse::RecDescent::skip = '';
my $parser = Parse::RecDescent->new($grammar);

my $code = q{var1=100 var2=90 var5=hello var3="a, b, c" var7=test var8=" haha \" heh " var3=hello};
$parser->startrule($code);

yields:

var1 => 100
var2 => 90
var5 => hello
var3 => "a, b, c"
var7 => test
var8 => " haha \" heh "
var3 => hello

PS. Note the double var3, if you want the latter assignment to overwrite the first one you can use a hash to store the values, and then use them later.

PPS. My first thought was to split on '=' but that would fail if a string contained '=' and since regexps are almost always bad for parsing, well I ended up trying it out and it works.

Edit: Added support for escaped quotes inside quoted strings.

nicomen
  • 1,183
  • 7
  • 16
  • thanks for your answer. I'll need to install Parse module on my particular system to try it out though. I would therefore favour a solution without this dependency. – therobyouknow Aug 12 '10 at 10:04
3

I've recently had to parse x509 certificates "Subject" lines. They had similar form to the one you have provided:

echo 'Subject: C=HU, L=Budapest, O=Microsec Ltd., CN=Microsec e-Szigno Root CA 2009/emailAddress=info@e-szigno.hu' | \
  perl -wne 'my @a = m/(\w+\=.+?)(?=(?:, \w+\=|$))/g; print "$_\n" foreach @a;'

C=HU
L=Budapest
O=Microsec Ltd.
CN=Microsec e-Szigno Root CA 2009/emailAddress=info@e-szigno.hu

Short description of the regex:

(\w+\=.+?) - capture words followed by '=' and any subsequent symbols in non greedy mode
(?=(?:, \w+\=|$)) - which are followed by either another , KEY=val or end of line.

The interesting part of the regex used are:

  • .+? - Non greedy mode
  • (?:pattern) - Non capturing mode
  • (?=pattern) zero-width positive look-ahead assertion
Delian Krustev
  • 2,826
  • 1
  • 19
  • 15
2

This one will provide you also common escaping in double-quotes as for example var3="a, \"b, c".

@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g;

In action:

echo 'var1=100 var2=90 var42="foo\"bar\\" var5=hello var3="a, b, c" var7=test var3=hello' |
perl -nle '@a = /(\w+=(?:\w+|"(?:[^\\"]*(?:\\.[^\\"]*)*)*"))/g; $,=","; print @a'
var1=100,var2=90,var42="foo\"bar\\",var5=hello,var3="a, b, c",var7=test,var3=hello
Hynek -Pichi- Vychodil
  • 26,174
  • 5
  • 52
  • 73
2
#!/usr/bin/perl

use strict; use warnings;

use Text::ParseWords;
use YAML;

my $string =
    "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my @parts = shellwords $string;
print Dump \@parts;

@parts = map { { split /=/ } } @parts;

print Dump \@parts;
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • 1
    I think this is better done with `Text::ParseWords` rather than `Text::Shellwords`. `Text::ParseWords` has similar functionality but is part of the Perl core. – dawg Aug 12 '10 at 01:32
  • 1
    @drewk Thanks for the reminder. I always confuse the two. I'll update the example to use `Text::ParseWords`. – Sinan Ünür Aug 12 '10 at 01:58
  • Works fine for me. See output further on in this comment. This depends on a module - I was lucky on my machine that this is present but for some Perl modules this is not always guaranteed on every distribution/platform. Here's the output: --- - var1=100 - var2=90 - var5=hello - 'var3=a, b, c' - var7=test - var3=hello --- - var1: 100 - var2: 90 - var5: hello - var3: 'a, b, c' - var7: test - var3: hello – therobyouknow Aug 12 '10 at 10:36
  • 1
    @Rob: I think that `Text::ParseWords` has been part of the core distribution since 5.00. The shellwords functionality is very useful and prior to 5.00 many used a shell eval to get that even with the security isk. Don't need to do that anymore since 5.00. – dawg Aug 12 '10 at 16:27
  • 1
    @Rob: Ask yourself which one is more maintainable: A complicated patter, a custom parser or a core module dependency. – Sinan Ünür Aug 12 '10 at 16:43
  • +1 for your answer. Thank you. Should have given it earlier. Good to know that the module has been in since 5.00 - that is a long time ago. So we should be safe with that. – therobyouknow Aug 12 '10 at 19:02
  • +1 re: your last comment about maintainability. Using a pure regex solution does have the advantage in portability across different languages for re-use. Though, yes my question was specifically for Perl. But pure regex answers should help developers in other languages too. In fact my 'homework' searching SO for an existing solution included looking at solutions in other languages to port across, though didn't find any that did exactly what I wanted hence asking the question here. – therobyouknow Aug 12 '10 at 22:55
  • @Rob: The source code for `Text::ParseWords` is available. http://cpansearch.perl.org/src/CHORNY/Text-ParseWords-3.27/ParseWords.pm See the pattern in `parse_line`. Isn't it good that someone else did that once so others can use it many, many, many times? – Sinan Ünür Aug 12 '10 at 23:08
  • I agree - reuse is a good thing. I'll check out the link as I'm very interested in parsers at the moment in general, for other similar problems where I'm migrating legacy code into a cleaner more modern data driven solution. So thank you very much @Sinan Ünür for your solution! – therobyouknow Aug 13 '10 at 08:38
2

You asked for a RegEx solution or other code. Here is a (mostly) non regex solution using only core modules. The only regex is \s+ to determine the delimiter; in this case one or more spaces.

use strict; use warnings;
use Text::ParseWords;
my $string="var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";  

my @array = quotewords('\s+', 0, $string);

for ( my $i = 0; $i < scalar( @array ); $i++ )
{
    print $i.": ".$array[$i]."\n";
}

Or you can execute the code HERE

The output is:

0: var1=100
1: var2=90
2: var5=hello
3: var3=a, b, c
4: var7=test
5: var3=hello

If you really want a regex solution, Alan Moore's comment linking to his code on IDEone is the gas!

dawg
  • 98,345
  • 23
  • 131
  • 206
0

It is possible to do this with regexes, however it's fragile.

my $string = "var1=100 var2=90 var5=hello var3=\"a, b, c\" var7=test var3=hello";

my $regexp = qr/( (?:\w+=[\w\,]+) | (?:\w+=\"[^\"]*\") )/x;
my @matches = $string =~ /$regexp/g;
szbalint
  • 1,643
  • 12
  • 20
  • Might need to add something missing or correct something here, as I get an error message when I run it: http://ideone.com/4bR1b and also on my own machine too. – therobyouknow Aug 12 '10 at 10:24
  • Bareword found where operator expected at ./regex_solution.pl line 8, near "qr/( (?:\w+=[\w\,]+) | ( syntax error at ./regex_solution.pl line 8, near "qr/( (?:\w+=[\w\,]+) | (?:\w+=\"[^\"]*\") )/xg" Execution of ./regex_solution.pl aborted due to compilation errors. – therobyouknow Aug 12 '10 at 10:24