regex pattern extract quotes

Question

Switching the code of the debate forum on my website, I am going to change the way quotes are stored in the database. Now I need to come up with a regex to rearrange already submitted posts in my database.

Following is an example of how my current debate post are stored in the database (with quotes in quotes).. Note: I have indented it for the sake of illustration:

Just citing a post
[quote]Text of quote #3
       [quote]Text of quote #2
              [quote]Text of quote #1
                     [name]User 1[/name]
              [/quote]
              [name]User 2[/name]
       [/quote]
       [name]User 3[/name]
[/quote]

What I would like now, is that the former will be rearranged to look like this:

Just citing a post
[quote:User 3]
      Text of quote #3
      [quote:User 2]
           Text of quote #2
           [quote:User 1]
                 Text of quote #1
           [/quote]  
      [/quote]   
[/quote]

Can any of you point me in the direction of how this can be done with regex? I am using PHP.

Thanks in advance, I appreciate all your help :)

Fischer

Are you planning to attach arbitrary user numbers to the existing quotes? I can see moving forward being easier, but taking existing data (without performing some kind of look-up) will be difficult. — Brad Christie, May 27 '11 at 16:30
The "User 1", "User 2", "User 3" is just for illustration purposes. In practice it will be the users username. Something like: [quote:Alex].. :) — fischer, May 27 '11 at 16:33
Have you checked the bbcode extension? http://hk.php.net/manual/en/book.bbcode.php — kennytm, May 27 '11 at 16:33
Understood. But if you're looking to re-format existing (stored) text, what flags `[code]` in a post as being `[code:foobar]`, for instance? — Brad Christie, May 27 '11 at 16:34
This is really easy to do. If I give you an answer in Perl, would you be able to convert it to PHP yourself? The regexes will be exactly the same, but I don’t have PHP installed to test it with and am unexpert at PHP procedural code. — tchrist, May 27 '11 at 16:44

joelhardi · Answer 1 · 2011-05-27T18:15:16.343

This function will do the job. It recursively reformats from the inner-most quotation to the outer-most:

function reformat($str) {
  while (preg_match('#\[quote\](.+)\[name\](.+)\[/name\]\s*\[/quote\]#Us',
         $str, 
         $matches)) {
    $str = str_replace($matches[0], 
                       '[quote:'.$matches[2].']'.$matches[1].'[/quote]',
                       $str);
  }
  return $str; 
}

In action:

$before = "Just citing a post
  [quote]Text of quote #3
    [quote]Text of quote #2
      [quote]Text of quote #1
        [name]User 1[/name]
      [/quote]
      [name]User 2[/name]
    [/quote]
    [name]User 3[/name]
  [/quote]";

echo reformat($before);

Outputs:

Just citing a post
  [quote:User 3]Text of quote #3
    [quote:User 2]Text of quote #2
      [quote:User 1]Text of quote #1
        [/quote]
      [/quote]
    [/quote]

Thanks for the tip ... fixed it! – joelhardi May 27 '11 at 18:11 — joelhardi, May 27 '11 at 18:11

score 1 · Accepted Answer · answered May 27 '11 at 17:11

This will do it:

$input = "Just citing a post
[quote]Text of quote #3
       [quote]Text of quote #2
              [quote]Text of quote #1
                     [name]User 1[/name]
              [/quote]
              [name]User 2[/name]
       [/quote]
       [name]User 3[/name]
[/quote]";

function fix_quotes($string) {
    $regexp = '`(\s*)\[quote\]((?:[^\[]|\[(?!quote\]))*?)\[name\](.*?)\[\/name\]\s*\[\/quote\]`';
    while (preg_match($regexp, $string)) {
        $string = preg_replace_callback($regexp, function($match) {
            return $match[1] . '[quote:' . $match[3] . ']' . trim(fix_quotes($match[2])) . $match[1] . '[/quote]';
        }, $string);
    }
    return $string;
}

echo fix_quotes($input);

Results in:

Just citing a post
[quote:User 3]Text of quote #3
       [quote:User 2]Text of quote #2
              [quote:User 1]Text of quote #1
              [/quote]
       [/quote]
[/quote]

Edit: haven't seen that joelhardi already posted similar solution, and his looks a bit cleaner so I'd stick with his solution :)

I wouldn't... His has a greediness problem. A cursory look makes it seem that yours will cope with `fix_quotes($input . ' ' . $input);`. — Denis de Bernardy, May 27 '11 at 17:58

score 0 · Answer 3 · edited May 23 '17 at 12:12

0

Don't use a regex for this. What you're talking about is essentially a mutation of XML, and regex is not the right tool for parsing XML. What you need to do is write a parser.

However, what I would suggest is using actual XML instead. It already exists, it's standardized, the syntax is almost exactly the same, and there are already a ton of parsers for it. I'd start here:

PHP XML parser

Edit: Just to clarify how easily this could become valid XML:

<quote src="User 3">
      Text of quote #3
      <quote src="User 2">
           Text of quote #2
           <quote src="User 1">
                 Text of quote #1
           </quote>  
      </quote>   
</quote>

edited May 23 '17 at 12:12

Community

1
1

answered May 27 '11 at 16:33

Justin Morgan - On strike

30,035
12
80
104

1

“Should not” is true only in the general case; often in specific ones, it is perfectly reasonable to use regexes. **But “cannot” is simply overreaching,** given that [modern patterns](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579) are fully equivalent to recursive descent parsers — which are of course perfectly capable of parsing XML. – tchrist May 27 '11 at 16:40
@tchrist - Okay, that's fair, on both counts. Edited. Still, regex is definitely wrong for this IMO. – Justin Morgan - On strike May 27 '11 at 16:43
@tchrist - It's also a matter of debate whether recursion-enabled regex engines should be considered true regex. Academically speaking, they're not. That's beyond the scope of the discussion, but there's still the fact that writing a recursive descent parser will be way more of a headache than parsing this as XML. – Justin Morgan - On strike May 27 '11 at 16:48
@Justin, academically speaking, regular expressions don't even exist in modern programming languages since pretty much all of them implement back-references which make them non-regular. Note that the term "regex" is usually used to describe these modern-day regex engines, not the academic "regular expressions". – Bart Kiers May 27 '11 at 18:08
To quote Larry Wall: _"'Regular expressions' are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes""_ – Bart Kiers May 27 '11 at 18:08
@Bart - You're right, and the academic point was really just an aside. I doubt the OP cares about the academic definition anyway, just the best way to solve his problem (which I think is an XML parser). – Justin Morgan - On strike May 27 '11 at 18:42
I agree with using an XML parser is the better option here. I just felt compelled to make the point since you were classifying regex engines that had the ability to match recursive patterns as "non-regular", implying that other regex engines _are_ "regular". Or, better, I was led to believe you were implying it :) – Bart Kiers May 27 '11 at 18:51
Edit #2: *Regex is not the right tool for parsing **XML.*** Derp. – Justin Morgan - On strike May 27 '11 at 22:20

score 0 · Answer 4 · answered May 27 '11 at 16:37

Because of the complexity involved here (you're going to need conditionals, as well as "Match/Replace All" functionality), I would recommend not doing this in just Regex. Use a programming language with tight Regex functionality, and combine Regex with this language to do what you want. I recommend PHP.

regex pattern extract quotes

4 Answers4