0

I need a help with this regexp..

using

/\{block:(Posts|Photos|Videos)(\s\[.*?\])?\}(\s?[^\"]+\s?)\{\/block\}/U

i get {block:Posts}abcdef{/block} from this:

<div>
 {block:Posts [a=1, b=2]}
  abcdef
 {/block}
</div>

But if my text is like this:

<div>
 {block:Posts [a=1, b=2]}
  {block:Text}
   abcdef
  {/block}
 {/block}
</div>

i get {block:Posts}{block:Text}abcdef{/block} because it's based on the first {/block} found in text.

A simple way to avoid this is using {/block:Posts} to close the block, but how can I do that since the opening block tag is optional (Posts|Photos|Videos)? If I open the block with Photos I must be sure it has to be closed with {/block:Photos}.

Using /\{block:(Posts|Photos|Videos)(\s\[.*?\])?\}(\s?[^\"]+\s?)\{\/block\:(Posts|Photos|Videos)\}/U of course doesn't help...

Can anyone please help me?

Thanks!!

PS
Is it possible, modifying the regex above, to get the optional parameters a and b as an array?

AldoB
  • 381
  • 1
  • 6
  • 15

2 Answers2

1

There might be an overall better solution for your problem, but you can use a backreference in this case, as (Posts|Photos|Videos) is capture group already:

\{\/block:\1\}
Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
  • Is it possible to get the optional parameters `[a=1, b=2]` as an array? array('a'=>1, 'b'=>2) – AldoB Nov 05 '11 at 11:04
  • So your language is PHP I assume? I think you have to extract that text and process it later on. – Felix Kling Nov 05 '11 at 11:07
  • Yes it is. So you recommend to process it later? Assuming I get the params in plain text like this `[a=1, b=2]`, would you use an explode (or stuff like that) or another regex? – AldoB Nov 05 '11 at 11:13
  • ah.. you wrote "There might be an overall better solution".. can you suggest that? – AldoB Nov 05 '11 at 11:22
  • No, otherwise I would have ;) I was just thinking about whether your expression could be simplified or whether there is a library which lets you define your own template language (and parse it). – Felix Kling Nov 05 '11 at 11:24
  • I've just realized that if my code is `{block:Posts}PLAINTEXT{/block:Posts}` the regex works fine, but if there's HTML inside the block `{block:Posts}
    some text inside HTML tag
    {/block:Posts}` the regex doesn't work anymore...! How can I fix that?
    – AldoB Nov 07 '11 at 17:09
  • You should probably changes this part `(\s?[^\"]+\s?)` into `(.*?)` or `([^{]*)`. The `"` in your original expression will prevent it from matching (HTML attributes). – Felix Kling Nov 07 '11 at 17:15
1

You can do this using a backreference:

\{block:(Posts|Photos|Videos)(\s\[.*?\])?\}(\s?[^\"]+\s?)\{\/block\1\}

Note the added backreference \1 at the end. The backreference will match whatever was matched by the first group, i.e. the first pair of parenthesis, in our case (Posts|Photos|Videos).

Note however that in general regular expressions are too limited to parse languages like HTML as explained by this post. Languages which require counting of opening entities (like brackets or tags) and then matching the exact number of closing entities can't be expressed using regular expressions. Another example of a language that isn't regular for this reason is the language of arithmetic expressions with parenthesis or a language composed of strings of the form aa...abb...b with the same number of a and b. General proof of this fact uses the Pumping Lemma.

Note also that regular expressions as used in software tools are usually a bit more powerful than bare mathematical regular expressions due to a number of additions beyond basic operations of union, concatenation and Kleene star that are provided by these software tools. Backreferences themselves constitue a major enhancement of regular expressions and allow one to express languages that are not considered regular in the mathematical sense. This is why your problem has a solution at all. Counting of opening and closing entities is still impossible, though.

Community
  • 1
  • 1
Adam Zalcman
  • 26,643
  • 4
  • 71
  • 92
  • What I'm trying to do is a Tumblr like theme parser that can allow me to use {block} to let users render certain contents (such as news, photos, videos, ...) in the way they like.. Reading the post and the link you suggested really taught me a lot and of course I know that writing a fast and reliable code parser is not easy at all.. At the moment regex is the best I can deal with.. But I'll work to improve that! Can you suggest me some documentation useful to do that? Thanks for your help! – AldoB Nov 05 '11 at 12:23
  • If you want to parse a simple language with opening and closing entities (here: {block:...} and {/block}) you may find that a simple *recursive decent parser* is a good solution. These are more powerful than regular expressions, but still very easy to program (example code is in this article: http://en.wikipedia.org/wiki/Recursive_descent_parser). If your needs go beyond that you may want to use a compiler-compiler like yacc or bison to generate a parser for you, but most of them will require you to come up with a formal grammar for the language you need to parse. – Adam Zalcman Nov 05 '11 at 13:10
  • Also, you can implement a recursive decent parser in any language of your choice, while compiler-compilers impose some limitations. Before you decide, you should however ensure that RDP is powerful enough to parse what you need. – Adam Zalcman Nov 05 '11 at 13:13
  • I've just realized that if my code is `{block:Posts}PLAINTEXT{/block:Posts}` the regex works fine, but if there's HTML inside the block `{block:Posts}
    some text inside HTML tag
    {/block:Posts}` the regex doesn't work anymore...! How can I fix that?
    – AldoB Nov 07 '11 at 17:14
  • Your regexp disallows double-quotes (") between {block...} and {/block...}: (\s?[^\"]+\s?). – Adam Zalcman Nov 07 '11 at 19:50