4

I'm using the regex <@(.+?)@> to match patterns such as:

<@set:template default.spt @>

It works fine, but I've run into situations where I needed to nest the pattern, such as this:

<@set:template <@get:oldtemplate @> @>

Instead of getting the parent pair (<@ and @>) I get the following:

<@set:template <@get:oldtemplate @>

I don't want it to get the child one, I just want the outermost parent in all nested situations. How to I fix my regex so that it will do this for me? I figure I could do it if I knew how to require for every <@ that there was one @> inside of the parent, but I have no idea on how to enforce that.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
Freesnöw
  • 30,619
  • 30
  • 89
  • 138

2 Answers2

5

What you describe is a "non-regular language". It cannot be parsed with a regexp.

Ok, if you are willing to put a limit to the nesting level, technically you can do it with a regexp. But it will be ugly.

Here is how to parse your thing with a few (increasing) maximum nesting depths, if you can put the condition of not having @'s inside your tags:

no nesting: <@[^@]+@>
up to 1:    <@[^@]+(<@[^@]+@>)?[^@]*@>
up to 2:    <@[^@]+(<@[^@]+(<@[^@]+@>)?[^@]*@>)?[^@]*@>
up to 3:    <@[^@]+(<@[^@]+(<@[^@]+(<@[^@]+@>)?[^@]*@>)?[^@]*@>)?[^@]*@>
...

If you cannot forbid lone @'s in your tags, you will have to replace every instance of [^@] with something like this: (?:[^<@]|<[^@]|@[^>]).

Just think about that and then think about extending your regex to parse up to 10 depth nesting.

Here, I will do it for you:

<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[
^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<
[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@[^>])+(<@(?:[^<@]|<[^@]|@
[^>])+(<@(?:[^<@]|<[^@]|@[^>])+@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>]
)*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@
>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>)?
(?:[^<@]|<[^@]|@[^>])*@>)?(?:[^<@]|<[^@]|@[^>])*@>

What I hope my answer shows is that regexp are not the right tool to parse a language. A traditional lexer (tokenizer) and parser combination will do a much better job, be significantly faster, and will handle indefinite nesting.

Tobia
  • 17,856
  • 6
  • 74
  • 93
  • It is possible to allow `@` and `>` while not consuming the end tag with `(?:(?!@>).)*`. Gotta love the end result. – nhahtdh May 16 '13 at 20:56
1

I don't think you can do this with a Regular Expression, see the answer to this question which asks a similar thing. Regexes aren't sufficiently powerful to deal with arbitrary levels of nesting, if you will only ever have 2 levels of nesting then it should be possible, but maybe regexes aren't the best tool for the job.

Community
  • 1
  • 1
codebox
  • 19,927
  • 9
  • 63
  • 81