The quick and dirty way is to define a regex which mostly matches the field assignment, then use that in another regex to match what's between them.
my $field_assignment_re = qr{^\s* field \s* = \s* [^;]+ ;}msx;
$code =~ /$field_assignment_re (.*?) $field_assignment_re/msx;
print $1;
The downside of this approach is it might match quoted strings and the like.
You can sort of parse code with regular expressions, but parsing it correctly is beyond normal regular expressions. This is because of the high amount of balanced delimiters (ie. parens and braces) and escapes (ie. "<foo \"bar\"">"
). To get it right you need to write a grammar.
Perl 5.10 added recursive decent matching to make writing grammars possible. They also added named capture groups to keep track of all those rules. Now you can write a recursive grammar with Perl 5.10 regexes.
It's still kinda clunky, Regexp::Grammar adds some enhancements to make writing regex grammars much easier.
Writing a grammar is about starting at some point and filling in the rules. Your program is a bunch of Statement
s. What's a Statement? An Assignment, or a FunctionCall followed by a ;
. What's an Assignment? Variable = Expression
. What is Variable
and Expression
? And so on...
use strict;
use warnings;
use v5.10;
use Regexp::Grammars;
my $parser = qr{
<[Statement]>*
<rule: Variable> \w+
<rule: FunctionName> \w+
<rule: Escape> \\ .
<rule: Unknown> .+?
<rule: String> \" (?: <Escape> | [^\"] )* \"
<rule: Ignore> \.\.\.?
<rule: Expression> <Variable> | <String> | <Ignore>
<rule: Assignment> <Variable> = <Expression>
<rule: Statement> (?: <Assignment> | <FunctionCall> | <Unknown> ); | <Ignore>
<rule: FunctionArguments> <[Expression]> (?: , <[Expression]> )*
<rule: FunctionCall> <FunctionName> \( <FunctionArguments>? \)
}x;
my $code = <<'END';
field = "test \" string";
alkjflkj;
type = INT;
funcCall(.., field, "escaped paren \)", ...);
...
text = "desc";
field = "test string 1";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 2";
field = "test string 2";
type = FLOAT;
funcCall(.., field, ...);
...
text = "desc 3";
END
$code =~ $parser;
This is far more robust than a regex. The inclusion of:
<rule: Escape> \\ .
<rule: String> \" (?: <Escape> | [^\"] )* \"
Handles otherwise tricky edge cases like:
funcCall( "\"escaped paren \)\"" );
It all winds up in %/
. Here's the first part.
$VAR1 = {
'Statement' => [
{
'Assignment' => {
'Variable' => 'field',
'Expression' => {
'String' => '"test string"',
'' => '"test string"'
},
'' => 'field = "test string"'
},
'' => 'field = "test string";'
},
...
Then you can loop through the Statement
array looking for Assignment
s where the Variable
matches field
.
my $seen_field_assignment = 0;
for my $statement (@{$/{Statement}}) {
# Check if we saw 'field = ...'
my $variable = ($statement->{Assignment}{Variable} || '');
$seen_field_assignment++ if $variable eq 'field';
# Bail out if we saw the second field assignment
last if $seen_field_assignment > 1;
# Print if we saw a field assignment
print $statement->{''} if $seen_field_assignment;
}
This might seem like a lot of work, but it's worth learning how to write grammars. There's a lot of problems which can be half-solved with regexes, but fully solved with a simple grammar. In the long run, the regex will get more and more complicated and never quite cover all the edge cases, while a grammar is easier to understand and can be made perfect.
The downside of this approach is your grammar might not be complete and it might trip up, though the Unknown
rule will take care of most of that.