4

I have a string that contains sequences delimited by multiple characters: << and >>. I need a regular expression to only give me the innermost sequences. I have tried lookaheads but they don't seem to work in the way I expect them to.

Here is a test string:

'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>'

It should return:

but match this
this too
and <also> this

As you can see with the third result, I can't just use /<<[^>]+>>/ because the string may have one character of the delimiters, but not two in a row.

I'm fresh out of trial-and-error. Seems to me this shouldn't be this complicated.

Svante
  • 50,694
  • 11
  • 78
  • 122
amphetamachine
  • 27,620
  • 12
  • 60
  • 72
  • 1
    Theoretically, this kind of problem (stack-ish, depending on some intermediate state, etc) needs a more expressive grammar than a regular. – miku Aug 09 '11 at 02:20
  • 1
    Regular expressions can only parse regular grammars. This is not a regular grammar. – cdhowie Aug 09 '11 at 02:21
  • 3
    Perl's regular expression aren't at all regular, and can parse this just fine. – brian d foy Aug 09 '11 at 09:14
  • @miku, @cdhowie: Since he wants the inner brackets and not the outer ones, there actually is a regular grammar for it. `/<<(?:[^<>]+|<[^<]|>[^>])*>>/` – ikegami Aug 09 '11 at 12:43
  • @ikegami: it's more complicated than that; that doesn't match all of `<<<>>` – ysth Aug 09 '11 at 15:40
  • @ysth, Indeed, though still possible to do using a regular grammar. Just a lot more wordy. – ikegami Aug 09 '11 at 16:39
  • hmm, does it work if you just end `*?>>` ? no, that still leaves the > alternation potentially getting the < from a << incorrectly – ysth Aug 09 '11 at 17:59

3 Answers3

9
@matches = $string =~ /(<<(?:(?!<<|>>).)*>>)/g;

(?:(?!PAT).)* is to patterns as [^CHAR]* is to characters.

ikegami
  • 367,544
  • 15
  • 269
  • 518
6
$string = 'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>';
@matches = $string =~ /(<<(?:[^<>]+|<(?!<)|>(?!>))*>>)/g;
ysth
  • 96,171
  • 6
  • 121
  • 214
0

Here's a way to use split for the job:

my $str = 'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>';
my @a = split /(?=<<)/, $str;
@a = map { split /(?<=>>)/, $_ } @a;

my @match = grep { /^<<.*?>>$/ } @a;

Keeps the tags in there, if you want them removed, just do:

@match = map { s/^<<//; s/>>$//; $_ } @match;
TLP
  • 66,756
  • 10
  • 92
  • 149