Regular expression can't handle rogue square brackets

Question

Thanks to the smarties on here in the past I have this amazing recursive regular expression that helps me to transform custom BBCode-style tags in a block of text.

/// <summary>
/// Static class containing common regular expression strings.
/// </summary>
public static class RegularExpressions
{
    /// <summary>
    /// Expression to find all root-level BBCode tags. Use this expression recursively to obtain nested tags.
    /// </summary>
    public static string BBCodeTags
    {
        get
        {
            return @"
                    (?>
                      \[ (?<tag>[^][/=\s]+) \s*
                      (?: = \s* (?<val>[^][]*) \s*)?
                      ]
                    )

                    (?<content>
                      (?>
                        \[(?<innertag>[^][/=\s]+)[^][]*]
                        |
                        \[/(?<-innertag>\k<innertag>)]
                        |
                        [^][]+
                      )*
                      (?(innertag)(?!))
                    )

                    \[/\k<tag>]
                    ";
        }
    }
}

This regex works beautifully, recursively matching on all tags. Like this:

[code]  
    some code  
    [b]some text [url=http://www.google.com]some link[/url][/b]  
[/code]

The regex does exactly what I want and matches the [code] tag. It breaks it up into three groups: tag, optional value, and content. Tag being the tag name ("code" in this case). Optional value being a value after the equals(=) sign if there is one. And content being everything between the opening and closing tag.

The regex can be used recursively to match nested tags. So after matching on [code] I would run it again against the content group and it would match the [b] tag. If I ran it again on the next content group it would then match the [url] tag.

All of that is wonderful and delicious but it hiccups on one issue. It can't handle rogue square brackets.

[code]This is a successful match.[/code]

[code]This is an [ unsuccessful match.[/code]

[code]This is also an [unsuccessful] match.[/code]

I really suck at regular expressions but if anyone knows how I might tweak this regex to correctly ignore rogue brackets (brackets that do not make up an opening tag and/or do not have a matching closing tag) so that it still matches the surrounding tags, I would be very appreciative :D

Thanks in advance!

Edit

If you are interested in seeing the method where I use this expression you are welcome to.

This is unmaintainable code... get rid of it and do it using imperative code logic instead. =) — Miguel Angelo, Oct 27 '11 at 21:08
Other than the rogue square brackets breaking the match it works perfectly. — CatDadCode, Oct 27 '11 at 21:10
@AlexFord Welcome to the part the regexes are not the tool for this job. Even if you fix this, how would you handle [[[[[]]]]]] etc? Sorry doesn't work. Use a lexer/parser combo instead. Perfect is something that accounts for all possible situations and this doesn't. — FailedDev, Oct 27 '11 at 21:14
If any of you have code samples demonstrating your preferred method I am open to it. This was the only way I could easily get this working and it has worked great so far. So unless I have a better solution shown to me I will likely stick with it. — CatDadCode, Oct 27 '11 at 21:16
I know it works... but once a regex is done, then only God knows what is means! Regexes are not maintainable, it is a **do-once-and-forget** thing. If you need to change it, it is a hell to understand it all over again. If you are planing on maintaining this code, I tell you, replace it with something that is **step-by-step-debugable**, so that you can easily fine tune it in the future, if you need. — Miguel Angelo, Oct 27 '11 at 21:16
@MiguelAngelo I attempted to manually parse this in the past but it seemed to completely spiral out of control and grow in complexity. Someone told me to do the opposite of what you're all suggesting and use this expression. Perhaps if you could show me a sample to point me in the right direction. — CatDadCode, Oct 27 '11 at 21:18
I really don't think I deserve a -1 for this question. Just because you don't prefer regular expressions does not mean there was something wrong with my question. I was very clear and took my time putting together this question. That is the same as thumbs downing a question because the author writes code in a language that's not you're favorite. If you believe something is wrong with my question then please comment and tell me what it is. — CatDadCode, Oct 27 '11 at 21:36
You seem to be trying to use the .Net ["Balancing Group Definitions" construct](http://msdn.microsoft.com/en-us/library/bs2twtah.aspx), and possibly because I am not very familiar with this particular construct, I don't understand what the `(?<-innertag>\k)` group is supposed to be, other than possibly "forget the innertag group, but don't store the match in any new group", which I don't see documented anywhere. This is a very obscure and hard to understand construct to begin with, much less with strange changes - most people will need more information to help you. — Code Jockey, Oct 27 '11 at 21:50
BTW, I think I might be able to help you, as this construct was added to the .Net Regex Implementation _specifically_ to handle parsing nested grammars such as HTML and BBCode, but I'll need more time, as I don't have a platform that is set up to handle testing and debugging of this construct (my emulator does not support it) - good luck until (or if I don't) come back with an answer! — Code Jockey, Oct 27 '11 at 21:53
@CodeJockey Unfortunately this was provided to me by someone else on SO. I will hunt down the question. Maybe there is more detail there. — CatDadCode, Oct 27 '11 at 21:53
@MiguelAngelo - Debugging is not the most important requirement in my opinion, unit testing is. The regex engine is a black box, much like any library you will be using. — Kobi, Oct 27 '11 at 21:55
http://stackoverflow.com/questions/7018321/help-fixing-a-bbcode-regular-expression there you go. That's the question where this regex was originally provided :) — CatDadCode, Oct 27 '11 at 21:55
@AlexFord: When I look at that answer, I see something horrible: A magic black box that seems to work perfectly, but that I don't have the capacity to understand or explain. Should I accept this dark gift, it's only a matter of time before it starts not behaving as I would want it to. And then it's poor me vs. the unholy mystical forces of the regex. **Pass**. — Jon, Oct 27 '11 at 23:03
@Jon, I don't understand why people think I am disagreeing. I'm not. I am open to alternatives. However I tried writing a manual parser and it went above my head pretty quickly. If you have a good start on an alternative then feel free to post it. Until then, you are welcome to "**Pass**". — CatDadCode, Oct 28 '11 at 14:19

score 3 · Accepted Answer · edited Oct 28 '11 at 14:58

I did a program that can parse your strings in a debugable, developer-friendly way. It is not a small code like those regexes, but it has a positive side: you can debug it, and fine tune it as you need.

The implementation is a descent recursive parser, but if you need some kind of contextual data, you can place it all inside the ParseContext class.

It is quite long, but I consider it as being better than a a regex based solution.

To test it, create a console application, and replace all the code inside Program.cs with the following code:

using System.Collections.Generic;
namespace q7922337
{
    static class Program
    {
        static void Main(string[] args)
        {
            var result1 = Match.ParseList<TagsGroup>("[code]This is a successful match.[/code]");
            var result2 = Match.ParseList<TagsGroup>("[code]This is an [ unsuccessful match.[/code]");
            var result3 = Match.ParseList<TagsGroup>("[code]This is also an [unsuccessful] match.[/code]");
            var result4 = Match.ParseList<TagsGroup>(@"
                        [code]  
                            some code  
                            [b]some text [url=http://www.google.com]some link[/url][/b]  
                        [/code]");
        }
        class ParseContext
        {
            public string Source { get; set; }
            public int Position { get; set; }
        }
        abstract class Match
        {
            public override string ToString()
            {
                return this.Text;
            }
            public string Source { get; set; }
            public int Start { get; set; }
            public int Length { get; set; }
            public string Text { get { return this.Source.Substring(this.Start, this.Length); } }
            protected abstract bool ParseInternal(ParseContext context);
            public bool Parse(ParseContext context)
            {
                var result = this.ParseInternal(context);
                this.Length = context.Position - this.Start;
                return result;
            }
            public bool MarkBeginAndParse(ParseContext context)
            {
                this.Start = context.Position;
                var result = this.ParseInternal(context);
                this.Length = context.Position - this.Start;
                return result;
            }
            public static List<T> ParseList<T>(string source)
                where T : Match, new()
            {
                var context = new ParseContext
                {
                    Position = 0,
                    Source = source
                };
                var result = new List<T>();
                while (true)
                {
                    var item = new T { Source = source, Start = context.Position };
                    if (!item.Parse(context))
                        break;
                    result.Add(item);
                }
                return result;
            }
            public static T ParseSingle<T>(string source)
                where T : Match, new()
            {
                var context = new ParseContext
                {
                    Position = 0,
                    Source = source
                };
                var result = new T { Source = source, Start = context.Position };
                if (result.Parse(context))
                    return result;
                return null;
            }
            protected List<T> ReadList<T>(ParseContext context)
                where T : Match, new()
            {
                var result = new List<T>();
                while (true)
                {
                    var item = new T { Source = this.Source, Start = context.Position };
                    if (!item.Parse(context))
                        break;
                    result.Add(item);
                }
                return result;
            }
            protected T ReadSingle<T>(ParseContext context)
                where T : Match, new()
            {
                var result = new T { Source = this.Source, Start = context.Position };
                if (result.Parse(context))
                    return result;
                return null;
            }
            protected int ReadSpaces(ParseContext context)
            {
                int startPos = context.Position;
                int cnt = 0;
                while (true)
                {
                    if (startPos + cnt >= context.Source.Length)
                        break;
                    if (!char.IsWhiteSpace(context.Source[context.Position + cnt]))
                        break;
                    cnt++;
                }
                context.Position = startPos + cnt;
                return cnt;
            }
            protected bool ReadChar(ParseContext context, char p)
            {
                int startPos = context.Position;
                if (startPos >= context.Source.Length)
                    return false;
                if (context.Source[startPos] == p)
                {
                    context.Position = startPos + 1;
                    return true;
                }
                return false;
            }
        }
        class Tag : Match
        {
            protected override bool ParseInternal(ParseContext context)
            {
                int startPos = context.Position;
                if (!this.ReadChar(context, '['))
                    return false;
                this.ReadSpaces(context);
                if (this.ReadChar(context, '/'))
                    this.IsEndTag = true;
                this.ReadSpaces(context);
                var validName = this.ReadValidName(context);
                if (validName != null)
                    this.Name = validName;
                this.ReadSpaces(context);
                if (this.ReadChar(context, ']'))
                    return true;
                context.Position = startPos;
                return false;
            }
            protected string ReadValidName(ParseContext context)
            {
                int startPos = context.Position;
                int endPos = startPos;
                while (char.IsLetter(context.Source[endPos]))
                    endPos++;
                if (endPos == startPos) return null;
                context.Position = endPos;
                return context.Source.Substring(startPos, endPos - startPos);
            }
            public bool IsEndTag { get; set; }
            public string Name { get; set; }
        }
        class TagsGroup : Match
        {
            public TagsGroup()
            {
            }
            protected TagsGroup(Tag openTag)
            {
                this.Start = openTag.Start;
                this.Source = openTag.Source;
                this.OpenTag = openTag;
            }
            protected override bool ParseInternal(ParseContext context)
            {
                var startPos = context.Position;
                if (this.OpenTag == null)
                {
                    this.ReadSpaces(context);
                    this.OpenTag = this.ReadSingle<Tag>(context);
                }
                if (this.OpenTag != null)
                {
                    int textStart = context.Position;
                    int textLength = 0;
                    while (true)
                    {
                        Tag tag = new Tag { Source = this.Source, Start = context.Position };
                        while (!tag.MarkBeginAndParse(context))
                        {
                            if (context.Position >= context.Source.Length)
                            {
                                context.Position = startPos;
                                return false;
                            }
                            context.Position++;
                            textLength++;
                        }
                        if (!tag.IsEndTag)
                        {
                            var tagGrpStart = context.Position;
                            var tagGrup = new TagsGroup(tag);
                            if (tagGrup.Parse(context))
                            {
                                if (textLength > 0)
                                {
                                    if (this.Contents == null) this.Contents = new List<Match>();
                                    this.Contents.Add(new Text { Source = this.Source, Start = textStart, Length = textLength });
                                    textStart = context.Position;
                                    textLength = 0;
                                }
                                this.Contents.Add(tagGrup);
                            }
                            else
                            {
                                textLength += tag.Length;
                            }
                        }
                        else
                        {
                            if (tag.Name == this.OpenTag.Name)
                            {
                                if (textLength > 0)
                                {
                                    if (this.Contents == null) this.Contents = new List<Match>();
                                    this.Contents.Add(new Text { Source = this.Source, Start = textStart, Length = textLength });
                                    textStart = context.Position;
                                    textLength = 0;
                                }
                                this.CloseTag = tag;
                                return true;
                            }
                            else
                            {
                                textLength += tag.Length;
                            }
                        }
                    }
                }
                context.Position = startPos;
                return false;
            }
            public Tag OpenTag { get; set; }
            public Tag CloseTag { get; set; }
            public List<Match> Contents { get; set; }
        }
        class Text : Match
        {
            protected override bool ParseInternal(ParseContext context)
            {
                return true;
            }
        }
    }
}

If you use this code, and someday find that you need optimizations because the parser has become ambiguous, then try using a dictionary in the ParseContext, take a look here for more info: http://en.wikipedia.org/wiki/Top-down_parsing in the topic Time and space complexity of top-down parsing. I find it very interesting.

This is working partially. It doesn't seem to handle the "optional values" after the equals sign like in `[url=http://www.google.com]some link[/url]`, but I do not expect you to do everything for me (unless you're so inclined :P). When I get some time I'll try to debug through and figure out what the heck is going on here so I can modify it to work with optional values. Thanks again by the way for all the effort. If I end up going your route I'll gladly mark accepted :) — CatDadCode, Oct 28 '11 at 15:19
Hi, I didn't make it parse the optional values. =\ ... but it is very simple to include this rule in the code. Any questions about the code, I'll be glad to help, as soon as I can! — Miguel Angelo, Oct 28 '11 at 15:23
I'll take a look on my lunch break and see if I can understand it. If not I'll probably be bugging you again! You've been a tremendous help. — CatDadCode, Oct 28 '11 at 15:24

score 1 · Answer 2 · answered Oct 27 '11 at 21:48

The first change is pretty simple - you can get it by changing [^][]+, which is responsible for matching the free text, to .. This seems a little crazy, perhaps, but it's actually safe, because you are using a possessive group (?> ), so all the valid tags will be matched by the first alternation - \[(?<innertag>[^][/=\s]+)[^][]*] - and cannot backtrack and break the tags.
(Remember to enable the Singleline flag, so . matches newlines)

The second requirement, [unsuccessful], seems to go against your goal it. The whole idea from the very start is not to match these unclosed tags. If you allow unclosed tags, all matches of the form \[(.*?)\].*?[/\1] become valid. Not good. At best, you can try to whitelist a few tags which are not allowed to be matched.

An example of both changes:

(?>
\[ (?<tag>[^][/=\s]+) \s*
(?: = \s* (?<val>[^][]*) \s*)?
\]
)
  (?<content>
    (?>
       \[(?:unsuccessful)\]  # self closing
       |
       \[(?<innertag>[^][/=\s]+)[^][]*]
       |
       \[/(?<-innertag>\k<innertag>)]
       |
       .
    )*
    (?(innertag)(?!))
  )
\[/\k<tag>\]

Working example on Regex Hero

"The second requirement, [unsuccessful], seems to go against your goal it. The whole idea from the very start is not to match these unclosed tags" --- Let me clarify. I would like the expression to match the outer tag ([code]), recognizing the [unsuccessful] piece as plain content of the [code] tag, realizing that [unsuccessful] does not have a [/unsuccessful] closing tag. What's happening is that the [code] tag will not be matched when [unsuccessful] is part of its content. — CatDadCode, Oct 27 '11 at 21:52
@Alex - That sort of parsing usually requires parsing of the whole document, start to end. I'll think about that one a little... — Kobi, Oct 27 '11 at 22:01
@AlexFord - I've added another answer, for the sport. Didn't explain too much, I'm afraid. Good luck `:)`. (oh, and it took me half of the time to write the pattern, and half to write the answer...) — Kobi, Oct 27 '11 at 22:53

score 0 · Answer 3 · answered Oct 27 '11 at 22:52

Ok. Here's another attempt. This one is a little more complicated.
The idea is to match the whole text from start to ext, and parse it to a single Match. While rarely used as such, .Net Balancing Groups allow you to fine tune your captures, remembering all positions and captures exactly the way you want them.
The pattern I came up with is:

\A
(?<StartContentPosition>)
(?:
    # Open tag
    (?<Content-StartContentPosition>)          # capture the content between tags
    (?<StartTagPosition>)                      # Keep the starting postion of the tag
    (?>\[(?<TagName>[^][/=\s]+)[^\]\[]*\])     # opening tag
    (?<StartContentPosition>)                  # start another content capture
    |
    # Close tag
    (?<Content-StartContentPosition>)          # capture the content in the tag
    \[/\k<TagName>\](?<Tag-StartTagPosition>)  # closing tag, keep the content in the <tag> group
    (?<-TagName>)
    (?<StartContentPosition>)                  # start another content capture
    |
    .           # just match anything. The tags are first, so it should match
                # a few if it can. (?(TagName)(?!)) keeps this in line, so
                # unmatched tags will not mess with the resul
)*
(?<Content-StartContentPosition>)          # capture the content after the last tag
\Z
(?(TagName)(?!))

Remember - the balancing group (?<A-B>) captures into A all text since B was last captured (and pops that position from B's stack).

Now you can parse the string using:

Match match = Regex.Match(sample, pattern, RegexOptions.Singleline |
                                           RegexOptions.IgnorePatternWhitespace);

Your interesting data will be on match.Groups["Tag"].Captures, which contains all tags (some of them are contained in others), and match.Groups["Content"].Captures, which contains tag's contents, and contents between tags. For example, without all blanks, it contains:

some code
some text
This is also an successful match.
This is also an [ unsuccessful match.
This is also an [unsuccessful] match.

This is pretty close to a full parsed document, but you'll still have to play with indices and length to figure out the exact order and structure of the document (though it isn't more complex than sorting all captures)

At this point I'll state what others have said - it may be a good time to write a parser, this pattern isn't pretty...

-1: I assume it works, but it makes me want to run away screaming — John Saunders, Oct 28 '11 at 03:10
@John - That seems fair, dealing with a -1 is a minor inconvenience compared to the feeling you describe `:)`. I did intend to leave two answers, I think they are different enough. — Kobi, Oct 28 '11 at 10:54
@Kobi I do not believe your answer deserves a -1. It's probably not the ideal solution, but it directly answers the question I asked. I would say as far as an answer goes it was well thought out and helpful. — CatDadCode, Oct 28 '11 at 14:23

Regular expression can't handle rogue square brackets

Edit

3 Answers3