2

I'm looking to remove empty elements from an XML file because the reader expects a value. It's not a nil xsi:nil="true" or element without content <Element /> Deserialize Xml with empty elements in C#. But Element where the inner part is simply missing <Element></Element>

I've tried writing my own code for removing these elements, but my code is too slow and the files too large. The end of every item will also contain this pattern. So the following regex would remove valid xml:
@"<.*></*>

I need some sort of regex that makes sure the pattern of the two * are the same.

So:

<Item><One>1</One><Two></Two><Three>3</Three></Item>

Would change into:

<Item><One>1</One><Three>3</Three></Item>

So the fact that it's all one one line makes this harder because it means the end of the item is right after the end of Three, producing the pattern I'd like to look for.

I don't have access to the original data that would allow recreating valid xml.

Community
  • 1
  • 1
MrFox
  • 4,852
  • 7
  • 45
  • 81

3 Answers3

2

You want to capture one or more word characters inside <...>
and match the closing tag by using \1 backreference to what was captured by first group.

<(\w+)></\1>

See demo at regex101

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • This seems to satisfy the OPs requirements, but what about the case of ``? If that's a possible scenario, Empty2 would get removed bu tnot Empty – Dan Field Jan 20 '16 at 14:13
  • @DanField To do what you're asking for he has to loop until no more matches are found (see last paragraph of my answer). In this regex there's a missing \ but I guess it's just a typo but IMO point is that capturing is unnecessary (in this very specific case) then it just hurts performance. – Adriano Repetti Jan 20 '16 at 14:16
  • Looping through could result in a completely empty document. And I suspect that there are probably more involved rules here to consider regarding what nodes should be present even if they are empty – Dan Field Jan 20 '16 at 14:18
  • 1
    @DanField possibly, only God and OP himself know what input data are! – Adriano Repetti Jan 20 '16 at 14:19
  • Yes, my only point is that simple approaches to XML formatting rarely work out as desired - but in some cases it can work – Dan Field Jan 20 '16 at 14:23
  • I agree, even parsing HTML is _possible_ (there is an excellent example here on SO) but honestly completely outside capabilities of most of us (that's why _do not parse HTML with regex_ post is so popular). – Adriano Repetti Jan 20 '16 at 14:25
  • @DanField In *.NET Regex* there's a feature available which is called [Balancing Groups](http://www.regular-expressions.info/balancing.html). For nested empty tags [try something like this](http://goo.gl/uzmt5v): `(?:<(?\w+)>|(?<-x>\1>))+(?(x)(?!))`. For allowing space inside [try like that](http://goo.gl/ZnCnbl): `(?:<(?\w+)>\s*|(?<-x>\s*\1>))+(?(x)(?!))`. I used `x` for the depth/counter. – bobble bubble Jan 20 '16 at 15:32
  • @MrFox I can add the balanced regex from my comment above to the answer, if interesting for your problem to also match nested empty tags. – bobble bubble Jan 20 '16 at 15:37
1

AFAIK there is no need to capture any group because <a></b> (which would match a simple regex without capturing) is just invalid XML and it can't be in your file (unless you're parsing HTML in which case - even if may be done - I'd suggest to do not use regex). Capturing a group is required only if you're matching non empty nodes but it's not your case.

Note that you have a problem with your regex (besides unescaped /) because you're matching any character with . but it's not allowed to have any character in XML tags. If you absolutely want to use .* then it should be .*? and you should exclude /).

What I would do is to keep regex as simple as possible (still matching valid XML node names or - even better - only what you know is your data input):

<\w+><\/\w+>

You should/may have a better check for tag name, for example \s*[\w\d]+\s* may be slightly better, regex with less steps will perform better for very large files. Also you may want to add an optional new-line between opening and closing tag.

Note that you may need to loop until no more replacements are done if, for example, you have <outer><inner></inner></outer> and you want it to be reduced to an empty string (especially in this case don't forget to compile your regex).

Adriano Repetti
  • 65,416
  • 20
  • 137
  • 208
0

Use XML Linq

string xml = "<Item><One>1</One><Two></Two><Three>3</Three></Item>";
            XElement item = XElement.Parse(xml);
            item = new XElement("Item", item.Descendants().Where(x => x.Value.Length != 0));
jdweng
  • 33,250
  • 2
  • 15
  • 20