Using Regex to split XML string before and after match

Question

I'm trying to format a XML document, so I pass a string into a method, such as:

"<foo><subfoo><subsubfoo>content</subsubfoo></subfoo><subfoo/></foo>"

And I'm trying to split it based on finding the tags. I want to split each element (a tag, or content) into a unique string, such as:

"<foo>", "<subfoo>", "<subsubfoo>", "content", "</subsubfoo>", "</subfoo>", "<subfoo/>", "</foo>"

And to this end I use the code:

string findTagString = "(?<=<.*?>)";
Regex findTag = new Regex(findTagString);
List<string> textList = findTag.Split(text).ToList();

The above code works fine, except it doesn't split "content" into its own string, instead:

"<foo>", "<subfoo>", "<subsubfoo>", "content</subsubfoo>", "</subfoo>", "<subfoo/>", "</foo>"

Is there a way to rewrite the Regex to acomplish this, the splitting of non-matches into their own string?

Or, rephrased: Is it possible to split a string before AND after a Regex match?

WHY do you want to do this? What is the end goal? There are probably more efficient ways to do this. — Erik Philips, Jul 10 '12 at 18:50
I'm just trying to create a group containing each tag or element so I can format them and place them into a FlowDocument to load into a RichTextBox (WPF). This is just how I'm aiming to break it into parts so I can examine, format, and insert the pieces. — Canin, Jul 10 '12 at 19:01

score 4 · Accepted Answer · answered Jul 10 '12 at 18:48

4

use this regex (<.*?>)|(.+?(?=<|$)) and cast matches to List<string>

answered Jul 10 '12 at 18:48

burning_LEGION

13,246
8
40
52

Thanks, that does the trick. Is there any way to remove the empty strings/not pick them up in the first place besides iterating through the list and removing empty ones? – Canin Jul 10 '12 at 18:59
you can replace empty tags recursive, or use this regex `(?<=>)([^<>]+?)(?=<)` for get value from tags – burning_LEGION Jul 10 '12 at 19:18

score 2 · Answer 2 · edited Nov 03 '13 at 09:07

2

Since by ignoring html specification, <> has no significance.

It can simply be done via split with this (?<=>)|(?=<).

This yields

<foo>
<subfoo>
<subsubfoo>
content
</subsubfoo>
</subfoo>
<subfoo/>
</foo>

edited Nov 03 '13 at 09:07

zero323

322,348
103
959
935

answered Jul 10 '12 at 20:03

score 1 · Answer 3 · answered Jul 10 '12 at 18:51

1

XML is not a Regular Language (can be proven with the Pumping Lemma), therefore XML cannot be parsed with Regular Expressions.

I suggest you find a good XML library and use it.

answered Jul 10 '12 at 18:51

Nicu Stiurca

8,747
8
40
48

1

I'm really just trying to do very basic formatting for the user so it can catch if they don't include a closing tag, or leave an attribute open. A very basic version of NotePad++'s XML view if you will. Thus I don't care what the tag says, just that there is a tag. So the fact that the language isn't finite, and thus isn't Regular, isn't of real concern for my application. Otherwise you would be right. Thanks for your help, SchighSchagh. – Canin Jul 10 '12 at 18:58

score 1 · Answer 4 · answered Jul 10 '12 at 19:05

you can do this via regex or xpath, depending on the complexity of the xml.

if you want to use regular expressions, you'd probably want to do something like this:

public static string xml = "<foo><subfoo><subsubfoo>content</subsubfoo></subfoo><subfoo/></foo>";
public static Regex re = new Regex(@"\<([A-Za-z0-9]*)\b[^>]*\>(.*?)\</\1\>");

static string GetContentViaRegex()
{
    string content = xml;
    while (re.IsMatch(content))
    {
        Match match = re.Match(content);
        if (!match.Success)
            break;

        content = match.Groups[2].Value;
    }
    return content;
}

the regex basically searches for matched opening/ending tags (you don't want to match something like <foo>stuff here, possibly including more tags</bar>), and you keep drilling into the matching tags until you find the innermost content. this regex assumes there are no attributes on any of the tags.

if you wanted to do this via xpath, you could do something like this:

static string GetContentViaXPath()
{
    var nav = new XPathDocument(new StringReader(xml)).CreateNavigator();
    return nav.SelectSingleNode("//text()").Value;
}

which basically grabs the first text node it hits in the document. (you'd want to add error checking unless you're sure the input will always be valid)

Nice regex for getting the whole xml element with subtree. very useful when you are working with xml fragments that are not well formed where XmlDocument, XmlReader will throw exceptions. — Daniel Bogdan, Feb 13 '15 at 15:02

Using Regex to split XML string before and after match

4 Answers4

Linked