I am trying to match elements that have no other children elements, but also have content. No content also includes whitespace and characters. I need to do this in C#.
Take this XML for instance:
<1>
<2><3 /></2>
<4>
<5>This is match 1</5>
</4>
<6>
</6>
<7> </7>
<8>This is match 2</8>
</1>
So only elements 5 and 8 match. The rest of the elements have child elements or "whitespace" (spaces, tabs, carriage returns, new lines, &nbsp;)
Note
SLaks posted:
"In general, you must not parse XML using regular expressions. Instead, use the System.Xml namespace."
This unfortunately is not viable in this situation. This is an application that was not made by my team and we need to optimize it without rewriting anything (not my decision). It is invalid XML and so I need to do this in order to make it valid. Then I can treat it as xml :)
So in other words, it is a string that closely resembles XML.
This is what I have come up with so far, it accounts for everything but the "whitespace" exclusion:
Regex ElementExpression = new Regex(
@"<(?'tag'\w+?).*>" + // match first tag, and name it 'tag'
@"(?'text'[^<>]*[\\S]+?)" + // match text content, name it 'text'
@"</\k'tag'>" // match last tag, denoted by 'tag'
, RegexOptions.Multiline | RegexOptions.Compiled | RegexOptions.IgnoreCase);