XML Regex Extraction

Question

I have an XML file and I need to extract data out of it.This task would be trivial if I only could use Xdocument, but the whole point of exercise is to create own parser using regex. The XML looks similar to below:

<A>
    <B>
        <C>ASD</C>
    </B>
    <B>
        <C>ZXC</C>
    </B>
</A>

I Came up with an idea that I can divide input to both closing and opening tag and their content.

        string acquiredFile = myStringBuilder.ToString();
        string regexPattern = "(?<open><[A-z0-9]{1,}>)(?<content>.*)(?<close></[A-z0-9]{1,}>)";
        Regex rx = new Regex(regexPattern, RegexOptions.Singleline);


        foreach (Match match in Regex.Matches(acquiredFile, regexPattern, RegexOptions.Singleline))
        {
            Console.WriteLine(match.Groups["open"].Value);
            Console.WriteLine(match.Groups["content"].Value);
            Console.WriteLine(match.Groups["close"].Value);
        }

I need to wrap it up in loop. Above extraction solution works only for single nested element in XML document such as:

<A>
    <B>
        <C>ASD</C>
    </B>
</A>

Could you please help me how to expand this code to get it to work with multiple nested elements.

Your code should work just fine with more than one nested element. (http://ideone.com/8iKn5i) — l'L'l, Jul 05 '14 at 07:59
Unfortunately it does not, I get the as opening tag, as closing tag and ASDZXC as content — user2847238, Jul 05 '14 at 08:04
Did you observe the example I linked? Your input might differ perhaps, which could throw it off. — l'L'l, Jul 05 '14 at 08:05
The .NET framework has an extensive supply of means to deal with XML the proper way. You are not to use regular expressions on XML. There is no excuse for trying. Please use [an API](http://blogs.msdn.com/b/xmlteam/archive/2011/09/14/effective-xml-part-1-choose-the-right-api.aspx). — Tomalak, Jul 05 '14 at 08:12

score 2 · Answer 1 · edited May 23 '17 at 12:11

You can deal with nested elements by recursion:

Wrap the code you use into a function

Parse(string html)
{
    var matches = Regex.Matches(html, yourRegexp, RegexOptions.Singleline);
    if (!matches.Any())
    {
       Console.WriteLine("CONTENT:"+html);
    }
    foreach (Match match in matches)
    {
       Console.WriteLine("OPEN:"+match.Groups["open"].Value);
       parse(match.Groups["content"].Value);
       Console.WriteLine("CLOSE:"+match.Groups["close"].Value);
    }
}

However, let me discourage you a bit first:

The above approach will not work with your regex (?<open><[A-z0-9]{1,}>)(?<content>.*)(?<close></[A-z0-9]{1,}>).
The first problem, as you mentioned, are the multiple consecutive ...... tags. Your regexp will capture everything from the first  to the last  into one group.

Now, a simple bugfix for this problem would be this regex <(?<open>[A-z0-9]{1,})>(?<content>.*?)<\1>, which will non-greedily match anything between the first <TAGNAME> and the next </TAGNAME2>, where TAGNAME and TAGNAME2 are the same string.

Looks good? Well, it is not, because this regexp will fail for nested elements with the same name, like <C></C>.

You will continue to run into these problems. As you come up with more and more complicated regex there will always be some sort of counterexample that causes them to break.

This is because regex are the wrong tools for this sort of task. You are trying to capture a Chomsky type 3 grammar with a Chomsky type 2 grammar. (Also see this humorous take on the subject).

In the end writing a proper parser for xml is far from a simple task, that is why the usual recommendation is to always go with one of the standard ones.

XML Regex Extraction

1 Answers1