0

Regex: http://regex101.com/r/uO6jQ3/1

String:

[quote]something [quote]something else[/quote] some text here[/quote]

What the current regex matches:

$matches[6][0]: [quote]something [quote]something else[/quote]

What it should match:

$matches[6][0]: [quote]something [quote]something else[/quote] some text here[/quote]

$matches[6][1]: [quote]something else[/quote]

Community
  • 1
  • 1
Wellenbrecher
  • 179
  • 1
  • 9
  • 1
    Do not parse HTML/BBCode/XML with regex. It's forbidden. – hsz Aug 20 '14 at 18:21
  • 2
    I'm sorry, regex doesn't work that way. – Unihedron Aug 20 '14 at 18:22
  • 1
    @hsz: not forbidden, but definitely "unable to handle all possible bbcode markup combinations". – Marc B Aug 20 '14 at 18:25
  • 1
    Be careful trying to parse nested tags with regex. It can summon [unholy ponies](http://stackoverflow.com/a/1732454/2370483) – Machavity Aug 20 '14 at 18:28
  • Even if you __do__ write a regex, it would not give you such desired output! As the nesting is __dynamic__, matches will not come out as groups, but instead as a single-liner. – Unihedron Aug 20 '14 at 18:59
  • 1
    What are you trying to do, recursively parse nested `[quote][/quote]` taking off the content along the way, building a hash tree or ? Checking for errors at the same time ? Sure that can be done with regex. –  Aug 20 '14 at 20:36

3 Answers3

0

To match a nested structure, you need a recursive pattern, example:

$data = '[quote]something [quote]something else[/quote] some text here[/quote]';

$pattern = '~\[quote](?>[^][]+|(?R))*\[/quote]~';

if (preg_match_all($pattern, $data, $m))
    print_r(m);

pattern details:

~           # pattern delimiter: do not choose the slash here
\[quote]    #
(?>         # open an atomic group: possible content between tags
    [^][]+  # all that is not a square bracket
  |         # OR
    (?R)    # recurse the whole pattern
)*          # close the atomic group, repeat zero or more times
\[/quote]   #
~

Note that it's quite easy. But now if your code may contain other parasite tags between "quote" tags, you only need to change the atomic group to allow them (written in extended mode):

(?> [^][]+ | \[/? (?!quote\b) [^]]* ] | (?R) )*
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

If you feel very ambitious you could sit in a while search loop and build up a nested array of content. Each new core matched needs to causes a reentrant call into the parse function that executes this regex.

 # //////////////////////////////////////////////////////
 # // The General Guide to 3-Part Recursive Parsing
 # // ----------------------------------------------
 # // Part 1. CONTENT
 # // Part 2. CORE
 # // Part 3. ERRORS

 (?is)

 (?:
      (                                  # (1), Take off CONTENT
           (?&content) 
      )
   |                                   # OR
      \[quote\]                          # Start-Delimiter
      (                                  # (2), Take off The CORE
           (?&core) 
        |  
      )
      \[/quote\]                         # End-Delimiter

   |                                   # OR
      (                                  # (3), Take off Unbalanced (delimeter) ERRORS
           \[/?quote\]
      )
 )

 # ///////////////////////
 # // Subroutines
 # // ---------------

 (?(DEFINE)

      # core
      (?<core>
           (?>
                (?&content) 
             |  
                \[quote\]
                # recurse core
                (?:
                     (?= . )
                     (?&core) 
                  |  
                )
                \[/quote\]
           )+
      )

      # content 
      (?<content>
           (?>
                (?!
                     \[/?quote\]
                )
                . 
           )+
      )

 )
  • There is something strange in your pattern, a pipe is alone after `(?&core)`, probably a typo. Or a way to make `(?&core)` optional? – Casimir et Hippolyte Aug 20 '14 at 21:19
  • Not a typo, its required the functions match something, this forces all the alternations to be hit. So we give the core a pass here because it and only it can be empty. –  Aug 20 '14 at 21:28
  • Btw, that `(?= . )` before the core function call is a Boost regex workaround, otherwise it throws `endless recursion`. I work with Boost a lot, needless to say have quite a few workarounds. –  Aug 20 '14 at 21:35
  • I see, but, if I'm not wrong, you don't need to prevent endless recursion with `(?=.)` *(that is here to ensure that there is at least one character)* here because `(?&content)` nor `(?&core)` can't match an empty string. *( `(?&content)` has at least one character, and `(?&core)` has either `(?&content)` or `[quote]...[/quote]`)* – Casimir et Hippolyte Aug 20 '14 at 21:46
  • You are right with other non-Boost engines, like Perl or PHP. The Boost guy Jonh Maddok didn't take it any farther (ie looking downstream for the `|`. –  Aug 20 '14 at 21:51
  • You'll notice the `(?&core)` is called directly inside `(?)`, there is the problem. At that point his math won't let him continue. –  Aug 20 '14 at 21:53
  • If I understand well we must therefore show reassuring for the pre-analysis of the pattern. – Casimir et Hippolyte Aug 20 '14 at 22:01
0

You would have to build a tree structure. Check out STML Parser on the CodeProject. STML Parser

Tawani
  • 11,067
  • 20
  • 82
  • 106