Matching double char delimited strings with regular expressions

Question

Suppose you want to match text that is delimited by double characters like so:

a = <<
Hello
World!
>>

The regular expression /<<(.*)>>/ would seem to do it, but unfortunately when these can be repeated greedy matching gets too much:

a = <<
Hello
World!
>>

b = <<
Goodbye
World!
>>

The previous regexp will capture

Hello
World!
>>

b = <<
Goodbye
World!

The obvious answer is to make the regexp non-greedy: /<<(.*?)>>/

Unfortunately this has extreme performance problems for long strings (in Perl at least). If the delimiters were single characters, then we could use a character class (everything but the character) to solve the greedy problem.

Any ideas on a regular expression to make this match without the performance penalty?

NB: I have to use Perl and this has to be a regular expression because of the larger system it's embedded in.

Thanks.

Phil - could you please elaborate on "has to be a regular expression because of the larger system it's embedded in" part? This seems like a job better suited to a parser (such as Text::Balanced) than a RegEx. — DVK, Jul 12 '10 at 20:37
Try looking at the second answer to this question: http://stackoverflow.com/questions/2975950/is-it-better-to-use-a-non-greedy-qualifier-or-a-lookahead — Josiah, Jul 12 '10 at 20:37
Your assertion that making a regex non-greedy "has extreme performance problems" seemed counter-intuitive to me. I would have thought a non-greedy should be faster since it can stop at the first match whereas the greedy version must carry on and then potentially backtrack. So I did some tests with various strings lengths and regexes and never saw the performance problem you refer to. Are you using an anchored match in your regex? — Grant McLean, Jul 12 '10 at 21:12
DVK, it's a small part of a Parse::RecDescent based parser. I need to preserve the white space, etc. inside the delimiters. Specifically, I define HTML: /<<.*?>>/ as a production and then use it in various places. — Phil Windley, Jul 12 '10 at 21:43
Are you trying to invent your own template format? There are already **many** excellent template modules on CPAN: I would recommend you use one. — Sinan Ünür, Jul 12 '10 at 22:01
@Josiah: second in which order? You should link directly (there is a "link" link at each answer). — Svante, Jul 12 '10 at 22:07
My apologies. Here's the permalink: http://stackoverflow.com/questions/2975950/is-it-better-to-use-a-non-greedy-qualifier-or-a-lookahead/2976353#2976353 — Josiah, Jul 12 '10 at 22:21
Thanks for the many answers to this question. Avoiding the obvious non-greedy construction and generalized lookahead with a specific alternation 'ala Hobb's fix of Drewk's answer or Moore's answer (if you're using > Perl 5.10 seem to work the best. — Phil Windley, Jul 13 '10 at 20:00
Either hobbs or Alan Moore answers fit your stated problem and it is considered good form to choose one as the answer... — dawg, Jul 14 '10 at 01:26

hobbs · Answer 1 · 2010-07-12T23:21:34.877

4

Expanding drewk's answer so it actually works:

/<<((?:(?>[^>]+)|>(?!>))*)>>/

Match "<<", then a sequence of 0 or more chunks which are either any number of non-">" characters, or a single ">" not followed by another ">", then finally ">>".

edited Jul 12 '10 at 23:21

answered Jul 12 '10 at 22:00

hobbs

223,387
19
210
288

+1: Yes that works, but almost the same number of probes as `<<(.*?)>>`, no? At least on my regex sim... – dawg Jul 12 '10 at 22:19
If that's the case, then I'm pretty sure they're both as efficient as it gets, because this is really a minimal representation of the state machine for matching `<<`..`>>`-delimited stuff. – hobbs Jul 12 '10 at 23:13
@drewk I made a change that should reduce the amount of backtracking that the pattern is allowed to make -- does that help any? – hobbs Jul 12 '10 at 23:22

score 3 · Answer 2 · answered Jul 13 '10 at 08:12

Are you using Perl 5.10? Try this:

/<<([^>]*+(?:>(?!>)[^>]*+)*+)>>/

Like the regex @hobbs posted, this one performs lookahead only after it finds a > (as opposed to the non-greedy quantifier, which effectively does a lookahead at every position). But this one uses Friedl's "unrolled loop" technique, which should be slightly faster than the alternation approach. Also, all quantifiers are possessive, so it doesn't bother saving the state information that would make backtracking possible.

+1: Sweet regex! It handles ever case I can think of and it as fast as `/<<(.*)>>/` Should be the answer IMHO. — dawg, Jul 13 '10 at 15:45

score 2 · Answer 3 · answered Jul 12 '10 at 21:14

2

Using a negated character class in this case will work:

/<<([^>]*)>>/ is the same probe count as /<<(.*)>>/ so should be just as fast with less backtracking as /<<(.*?)>>/

I do agree with DVK however; is a regex the only way?

answered Jul 12 '10 at 21:14

dawg

98,345
23
131
206

Except that > can occur inside the <<...>> delimiters: a = << hey >> – Phil Windley Jul 12 '10 at 21:46
Ahmm, that was not specified in your post. Are your parsing HTML with a regex? Please look at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 for some discussion on why this is not a great idea. – dawg Jul 12 '10 at 22:24
No, not trying to parse HTML. I understand the dangers there. Just trying to consume everything between the delimiters (including white space). Sorry not to have specified the need for > inside the delimiters. – Phil Windley Jul 12 '10 at 22:41

score 1 · Answer 4 · answered Jul 12 '10 at 20:42

1

Please see if the performance of a dedicated parser (such as Text::Balanced) would be acceptable in this case. It's not regex, but without more details on your "NB" poststcriptum it sounds like you might have an XY problem when looking for a regex-only solution.

If you absolutely must use a regex, please look at using a look-ahead functionality - it may improve the speed.

answered Jul 12 '10 at 20:42

DVK

126,886
32
213
327

When you say "look-ahead functionality" can you elaborate? I think that the capturing group would have the greediness quantifier and the lookahead does not fix this, no? ie, any variant of positive lookahead, look behind, will still need to limit the scope of the match. `/<<(.*)(?=>>)/` will still have the same problem of matching to the final delimiter. The only easy solution I see is a negative character class on the first character of the closing delimiter. `<<([^>]*)>>` is as efficient as `<<(.*)>>` in terms of total probe count and gives the right answer. – dawg Jul 12 '10 at 21:48
Text::Balanced is cool but I don't think it really fits this problem, both because OP didn't ask for nesting and because Text::Balanced is really meant to deal with single-char delimiters and constructs like backslashing. – hobbs Jul 12 '10 at 22:06
BTW, I tried lookahead and it works, but with identical performance to the non-greedy regexp specification. – Phil Windley Jul 12 '10 at 22:42

score 1 · Answer 5 · answered Jul 12 '10 at 22:43

Say you have a simple grammar

my $p = Parse::RecDescent->new(<<'EOGrammar');
  program: assignment(s)

  assignment: id '=' '<<' angle_text '>>'
              { $return = [ $item{id}, $item{angle_text} ] }

  angle_text: <skip:undef> / ( [^>] | >(?!>) )* /x

  id: /\w+/
EOGrammar

and a source text of

a = <<
Hello

World!

>>

b = <<


Goodbye
World!
>>

When you process the result with

for (@{ $p->program($text) }) {
  my($name,$what) = @$_;
  print "$name: [[[$what]]]\n";
}

you'll see output of

a: [[[
Hello

World!

]]]
b: [[[


Goodbye
World!
]]]

Matching double char delimited strings with regular expressions

5 Answers5