Perl: Regex to get all text between repeating patterns

Question

I would like to create a regex for the following.

I have some text like the following:

field = "test string";
type =  INT;
funcCall(.., field, ...);
...
text = "desc";

field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";

field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";

.... keeps repeating

Basically I'm trying to create a regex that would get all text from the start of the first "field =" to the start of the second "field = ". It has to skip past the field text used in the function call.

I currently have the following:

my @overall = ($string =~ m/field\s*=.*?/gis);

However, this just obtains the text "field = ". Without the "?" it gets all the data from the first all the way to the very last instance.

I also tried:

my @overall = ($string =~ m/field\s*=.*field\s*=/gis);

However, that will then get me every other instance since it is possessive of the second "field =" string. Any suggestions?

This sort of problem is hard for a regular expression. I would suggest instead writing a grammar using [Parse::RecDescent](https://metacpan.org/pod/Parse::RecDescent) or [Regexp::Grammars](https://metacpan.org/pod/Regexp::Grammars). — Schwern, Oct 26 '15 at 21:30
You can only do [something like this with a tempered greedy token](https://regex101.com/r/fC8bS2/1). Skipping is not allowed, unless you want to use two look-aheads and capture some discontinuous chunks of text. — Wiktor Stribiżew, Oct 26 '15 at 21:35
This is easily done with regex. The problem is your examples demonstrate a complex language. Without details, it would seem more appropriate for a parser of that particular language. — , Oct 26 '15 at 21:42

Axeman · Answer 1 · 2015-10-28T15:23:24.057

The easiest way I can see to do this is to split the $string by the /^\s*field\s*=/ expression. If we want to capture the 'field = ' portion of the text, we can break on a look-ahead:

foreach ( split /(?=^\s*field\s*=)/ms, $string ) {
    say "\$_=[\n$_]";
}

Thus, it breaks at the start of every line where 'field' is the next non-whitespace string, followed by any amount of whitespace, followed by a '='.

The output is:

$_=[
field = "test string";
type =  INT;
funcCall(.., field, ...);
...
text = "desc";
]
$_=[

]
$_=[
field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";
]
$_=[

]
$_=[
field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";

.... keeps repeating
]

Almost what I wanted. But, it leaves an artifact of a blank line that occurs between the captures we do want. I'm not sure how to get rid of it, so we'll just filter out all-whitespace strings:

foreach ( grep { m/\S/ } split /(?=^\s*field\s*=)/ms, $string ) {
    say "\$_=[\n$_]";
}

And then it yields:

$_=[
field = "test string";
type =  INT;
funcCall(.., field, ...);
...
text = "desc";
]
$_=[
field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";
]
$_=[
field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";

.... keeps repeating
]

Which you can work with.

++ When ancient and humble built in functions (like [`split()`](http://perldoc.perl.org/functions/split.html) - and many others) take a regular expression as an argument, you get massive power in **one line**. You *could* have put it all in one line ... right? :-) — G. Cito, Oct 30 '15 at 14:13
@G.Cito, Well, I consider the `grep { m/\S/ } split /(?=^\s*field\s*=)/ms, $string` the part that does it. And that's all on one line. The for loop just displays it. And I have to say, I don't like the suffix for loop quite as much as the structured one, and I rarely use `map` in a void context. But yeah: `say "\$_=[\n$_]" for grep { m/\S/ } split /(?=^\s*field\s*=)/ms, $string` is valid a valid Perl one-liner. — Axeman, Oct 30 '15 at 15:44

score 4 · Answer 2 · edited Jan 29 '16 at 03:30

The quick and dirty way is to define a regex which mostly matches the field assignment, then use that in another regex to match what's between them.

my $field_assignment_re = qr{^\s* field \s* = \s* [^;]+ ;}msx;

$code =~ /$field_assignment_re (.*?) $field_assignment_re/msx;
print $1;

The downside of this approach is it might match quoted strings and the like.

You can sort of parse code with regular expressions, but parsing it correctly is beyond normal regular expressions. This is because of the high amount of balanced delimiters (ie. parens and braces) and escapes (ie. "<foo \"bar\"">"). To get it right you need to write a grammar.

Perl 5.10 added recursive decent matching to make writing grammars possible. They also added named capture groups to keep track of all those rules. Now you can write a recursive grammar with Perl 5.10 regexes.

It's still kinda clunky, Regexp::Grammar adds some enhancements to make writing regex grammars much easier.

Writing a grammar is about starting at some point and filling in the rules. Your program is a bunch of Statements. What's a Statement? An Assignment, or a FunctionCall followed by a ;. What's an Assignment? Variable = Expression. What is Variable and Expression? And so on...

use strict;
use warnings;
use v5.10;

use Regexp::Grammars;

my $parser = qr{
  <[Statement]>*

  <rule: Variable>      \w+
  <rule: FunctionName>  \w+
  <rule: Escape>        \\ .
  <rule: Unknown>       .+?
  <rule: String>        \" (?: <Escape> | [^\"] )* \"
  <rule: Ignore>        \.\.\.?
  <rule: Expression>    <Variable> | <String> | <Ignore>
  <rule: Assignment>    <Variable> = <Expression>
  <rule: Statement>     (?: <Assignment> | <FunctionCall> | <Unknown> ); | <Ignore>
  <rule: FunctionArguments>     <[Expression]> (?: , <[Expression]> )*
  <rule: FunctionCall>  <FunctionName> \( <FunctionArguments>? \)
}x;

my $code = <<'END';
field = "test \" string";
alkjflkj;
type =  INT;
funcCall(.., field, "escaped paren \)", ...);
...
text = "desc";

field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";

field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";
END

$code =~ $parser;

This is far more robust than a regex. The inclusion of:

<rule: Escape>        \\ .
<rule: String>        \" (?: <Escape> | [^\"] )* \"

Handles otherwise tricky edge cases like:

funcCall( "\"escaped paren \)\"" );

It all winds up in %/. Here's the first part.

$VAR1 = {
          'Statement' => [
                           {
                             'Assignment' => {
                                               'Variable' => 'field',
                                               'Expression' => {
                                                                 'String' => '"test string"',
                                                                 '' => '"test string"'
                                                               },
                                               '' => 'field = "test string"'
                                             },
                             '' => 'field = "test string";'
                           },
          ...

Then you can loop through the Statement array looking for Assignments where the Variable matches field.

my $seen_field_assignment = 0;
for my $statement (@{$/{Statement}}) {
    # Check if we saw 'field = ...'
    my $variable = ($statement->{Assignment}{Variable} || '');
    $seen_field_assignment++ if $variable eq 'field';

    # Bail out if we saw the second field assignment
    last if $seen_field_assignment > 1;

    # Print if we saw a field assignment
    print $statement->{''} if $seen_field_assignment;
}

This might seem like a lot of work, but it's worth learning how to write grammars. There's a lot of problems which can be half-solved with regexes, but fully solved with a simple grammar. In the long run, the regex will get more and more complicated and never quite cover all the edge cases, while a grammar is easier to understand and can be made perfect.

The downside of this approach is your grammar might not be complete and it might trip up, though the Unknown rule will take care of most of that.

++ excellent post. Thanks! What do you think of PEG approach and the [`Pegex`](https://metacpan.org/pod/distribution/Pegex/lib/Pegex.pod) implementation of it on CPAN compared to `Regexp::Grammar`? My post is just a brief introduction but I wonder if Perl6's use of grammars is going to influence perl5 development with regard to the use of grammars to enhance the power of regexps. — G. Cito, Oct 30 '15 at 05:41
@G.Cito I don't have an opinion about which grammar library to use. I don't have much experience with them, this is my first time using Regexp::Grammar and first I've seen of Pegex. I do think Perl 6 means we'll see more use of grammars in Perl 5, it's already happened, and also as more people become aware of Perl 5 named and recursive patterns. But I haven't been keeping up on current events. — Schwern, Oct 30 '15 at 06:29

score 1 · Answer 3 · edited May 23 '17 at 12:15

For overall "whipupitude" regarding your sample data I think passing a pattern to split is going to be the easiest. But, as @Schwern points out, when things get more complex using a grammar helps.

For fun I created an example script that parses your data using a parsing expression grammar built with Pegex. Both Regexp::Grammar and Regexp::Common have the advantage of widespread use and familiarity when it comes to quickly constructing a grammar. There's a low barrier to entry if you already know perl and need a simple but super powered version of regular expressions for your project. The Pegex approach is to try to make it easy to construct and use grammars with perl. With Pegex you build a parsing expression grammar out of regular expressions:

"Pegex... gets it name by combining Parsing Expression Grammars (PEG), with Regular Expessions (Regex). That's actually what Pegex does." (from the POD).

Below is a standalone script that parses a simplified version of your data using a Pegex grammar.

First the script reads out $grammar "inline" as a multi-line string and uses it to ->parse() the sample data which it reads from the <DATA> handle. Normally the parsing grammar and data would reside in separate files. The grammar's "atoms" and regular expressions are compiled using the pegex function into a "tree" or hash of regular expressions that is used to parse the data. The parse() method returns a data structure that can be used by perl. Adding use DDP and p $ast to the script can help you see what structures (AoH, HoH, etc.) are being returned by your grammar.

#!/usr/bin/env perl
use v5.22;
use experimental qw/ refaliasing postderef / ;
use Pegex;

my $data = do { local $/; <DATA> } ;

my $grammar = q[
%grammar thing
%version 0.0.1

things: +thing*
thing: (+field +type +text)+ % end 

value: / <DOUBLE> (<ANY>*) <DOUBLE> /
equals: / <SPACE> <EQUAL>  <SPACE> /
end: / BLANK* EOL / 

field: 'field' <equals> <value> <SEMI> <EOL>
type:  'type' <equals> /\b(INT|FLOAT)\b/ <SEMI> <EOL>
func:  / ('funcCall' LPAREN <ANY>* RPAREN ) / <SEMI> <EOL> .( <DOT>3 <EOL>)*
text:  'text' <equals> <value> <SEMI> <EOL>    
];

my $ast = pegex($grammar, 'Pegex::Tree')->parse($data);

for \my @things ( $ast->[0]->{thing}->@* ) {
  for \my %thing ( @things ) { 
    say $thing{"text"}[0] if $thing{"text"}[0] ; 
    say $thing{"func"}[0] if $thing{"func"}[0] ; 
  }
}

At the very end of the script a __DATA__ section holds the content of the file to parse:

__DATA__
field = "test string 0";
type = INT;
funcCall(.., field, ...);
...
text = "desc 1";

field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";

field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";

You could of course just as easily read the data from a file handle or STDIN in the classic perl fashion or, for example, using IO::All where we could do:

use IO::All; 
my $infile < io shift ; # read from STDIN

You can install Pegex from CPAN and then download and play with the gist to get a feel for how Pegex works.

With Perl6 we are getting a powerful and easy to use "grammar engine" that builds on Perl's strengths in handling regular expressions. If grammars start to get used in a wider range of projects these developments are bound to feed back into perl5 and lead to even more powerful features.

The PEG part of Pegex and its cross language development allows grammars to be exchanged between different programming language communities (Ruby, Javascript). Pegex can be used in fairly simple scenarios, and fits nicely into more complex modules that require parsing capabilities. The Pegex API allows for easy creation of a rule derived set of functions that can be defined in a "receiver class". With a receiver class you can build sophisticated methods for working with your parsed data that allow you to "munge while you parse", and even modify the grammar on the fly (!) More examples of working grammars that can be re-purposed and improved, and a growing selection of modules that use Pegex will help it become more useful and powerful.

Perhaps the simplest approach to trying out the Pegex framework is Pegex::Regex - which allows you to use grammars as conveniently as regexps, storing the results of your parse in %/. The author of Pegex calls Pegex::Regex the "gateway drug" to parsing expression grammars and notes it is "a clone of Damian Conway's Regexp::Grammars module API" (covered by @Schwern in his answer to this question).

It's easy to get hooked.

score 0 · Answer 4 · answered Oct 26 '15 at 21:44

0

This is hard for a regex. Fortunately, that isn't the only tool in your box.

It looks like you have a blank line between each record. If so, you can do this easily by setting $/ to "\n\n". Then you can read your file with a while loop, and each iteration $_ will be set to the chunk you are trying to handle.

Failing that, you could set it to field = or perhaps even just use split

answered Oct 26 '15 at 21:44

Sobrique

52,974
7
60
101

1

`$/` is clever, but it is limited to strings. The text looks like some sort of programming language, so the spaces between `field` and `=` are likely to change. – Schwern Oct 26 '15 at 22:38
1

I was working on the sample data given being blank line delimited. You may be right, but given what we have to work with, that may also be overkill. – Sobrique Oct 26 '15 at 22:57
Of course! After all perl started out being an improvement on `awk` :-) Hmm but `perl -MData::Dumper -ne '$/="\n\n" ; push @arr, [$_] ;}{ print Dumper @arr' data.txt` gives me one extra split though (the first `field` line ends up on its own). – G. Cito Oct 30 '15 at 14:48
@Schwern interestingly [perlvar](http://perldoc.perl.org/perlvar.html) says: "As of 5.19.9 setting `$/` to any other form of reference will throw a fatal exception. This is in preparation for supporting new ways to set `$/` in the future." Would be neat to be able to pass a regexp as a field separator - I see a whole new generation of obfuscated one-liners :-) – G. Cito Oct 30 '15 at 14:53
@Sobrique oops! Forgot I needed to set `$/` in a `BEGIN{}` block and then `chomp` to get it to work properly: `perl -MDDP -ne 'BEGIN { $/="\n\n";} chomp; push @arr, $_ ;}{ p @arr' data.txt` – G. Cito Oct 30 '15 at 15:02
That would be `perl -MData::Dumper -ne 'BEGIN { $RS="\n\n";} chomp; push @arr, [$_] ;}{ print Dumper @arr' data.txt` to stick with core modules. As well, [`perlrun`](http://perldoc.perl.org/perlrun.html) shows how to set `$/` with the `-0` (zero) switch: `perl -MData::Dumper -00 -ne 'chomp; push @arr, [$_] ;}{ print Dumper @arr' data.txt`. Being able to *easily* create parsing grammars is a pretty new thing, but `$/` and the `-0` switch date from perl4 or earlier so answer to this question have covered a wide swath of history - just like perl. – G. Cito Oct 30 '15 at 15:24

score 0 · Answer 5 · answered Oct 26 '15 at 21:46

0

This is trivial with awk

$ awk -v RS= 'NR==1' file
field = "test string";
type =  INT;
funcCall(.., field, ...);
...
text = "desc";

use paragraph mode, print first record.

answered Oct 26 '15 at 21:46

karakfa

66,216
7
41
56

This relies on that each `field = ...` happens to have a newline in front of it. The text is some sort of programming language, this assumption will not hold. – Schwern Oct 26 '15 at 22:37
`perl -00 -nE 'chomp; push @_, $_ ;}{ say $_[0]' file` for the purpose of archaeological comparison :-) – G. Cito Oct 30 '15 at 15:57
`perl6 -e 'say slurp.split("\n\n")[0]' file` ... archaeology and evolutionary adaptation. – G. Cito Oct 30 '15 at 19:16
1

this is more readable. couple more evolutions later it may resemble awk. – karakfa Oct 30 '15 at 19:29
The awk is only readable if you know awk. The only way I can read awk is to pipe it into `a2p` and read the Perl 5 output. If I started programming on Unix systems instead of DOS/Windows I may have learned awk and sed before Perl, but now that I know Perl 5&6 it doesn't seem worth it as everything you can do easily with awk and sed you can do easily with Perl (especially since it copied a few ideas from them). – Brad Gilbert Nov 01 '15 at 16:49
That's right but what I also meant is, the code is direct translation of the description. Use paragraph mode: `-v RS=` print first record: `NR==1` Once you know what those mean it's trivial to solve certain programming tasks with `awk` – karakfa Nov 01 '15 at 23:39

Perl: Regex to get all text between repeating patterns

5 Answers5