Removing empty elements from xml with regex that matches a sequence twice

Question

I'm looking to remove empty elements from an XML file because the reader expects a value. It's not a nil xsi:nil="true" or element without content <Element /> Deserialize Xml with empty elements in C#. But Element where the inner part is simply missing <Element></Element>

I've tried writing my own code for removing these elements, but my code is too slow and the files too large. The end of every item will also contain this pattern. So the following regex would remove valid xml:
@"<.*></*>

I need some sort of regex that makes sure the pattern of the two * are the same.

So:

<Item><One>1</One><Two></Two><Three>3</Three></Item>

Would change into:

<Item><One>1</One><Three>3</Three></Item>

So the fact that it's all one one line makes this harder because it means the end of the item is right after the end of Three, producing the pattern I'd like to look for.

I don't have access to the original data that would allow recreating valid xml.

You need to capture: `<(\w+)><\/\1>` [see demo](https://regex101.com/r/kV0vP8/1). — bobble bubble, Jan 20 '16 at 13:21
Just run a `@"<(\w+)>\1>"` -> "" replacement in a loop until no match is found. To account for weird tags, you can add some more characters: `@"<([\w:.-]+)>\1>"`. — Wiktor Stribiżew, Jan 20 '16 at 13:22
AFAIK **there is no need to capture**, is just **invalid XML**. Change your regex to be valid. Kind of: `<\w+><\/\w+>` (better check for tag, of course...) — Adriano Repetti, Jan 20 '16 at 13:28
You shouldn't use regex on a valid xml string. It is not efficient and using xml linq is the recommended solution. You must have a schema that is rejecting the xml. Try reading xml without schema. — jdweng, Jan 20 '16 at 13:47
@jdweng that's valid XML (and empty node) and I'd expect regex to be many many times faster than LINQ to XML (especially for very simple regex and very big XML files). More often than not LINQ is nice but not fast. — Adriano Repetti, Jan 20 '16 at 13:53
Regex is not very efficient . There is lots of nested recursion in the algorithm. XML Linq is a straight search algorithm. — jdweng, Jan 20 '16 at 15:55

bobble bubble · Accepted Answer · 2016-01-20T13:30:55.597

2

You want to capture one or more word characters inside <...>
and match the closing tag by using \1 backreference to what was captured by first group.

<(\w+)></\1>

See demo at regex101

edited Jan 20 '16 at 13:30

answered Jan 20 '16 at 13:25

bobble bubble

16,888
3
27
46

This seems to satisfy the OPs requirements, but what about the case of ``? If that's a possible scenario, Empty2 would get removed bu tnot Empty – Dan Field Jan 20 '16 at 14:13
@DanField To do what you're asking for he has to loop until no more matches are found (see last paragraph of my answer). In this regex there's a missing \ but I guess it's just a typo but IMO point is that capturing is unnecessary (in this very specific case) then it just hurts performance. – Adriano Repetti Jan 20 '16 at 14:16
Looping through could result in a completely empty document. And I suspect that there are probably more involved rules here to consider regarding what nodes should be present even if they are empty – Dan Field Jan 20 '16 at 14:18
1

@DanField possibly, only God and OP himself know what input data are! – Adriano Repetti Jan 20 '16 at 14:19
Yes, my only point is that simple approaches to XML formatting rarely work out as desired - but in some cases it can work – Dan Field Jan 20 '16 at 14:23
I agree, even parsing HTML is _possible_ (there is an excellent example here on SO) but honestly completely outside capabilities of most of us (that's why _do not parse HTML with regex_ post is so popular). – Adriano Repetti Jan 20 '16 at 14:25
@DanField In *.NET Regex* there's a feature available which is called [Balancing Groups](http://www.regular-expressions.info/balancing.html). For nested empty tags [try something like this](http://goo.gl/uzmt5v): `(?:<(?\w+)>|(?<-x>\1>))+(?(x)(?!))`. For allowing space inside [try like that](http://goo.gl/ZnCnbl): `(?:<(?\w+)>\s*|(?<-x>\s*\1>))+(?(x)(?!))`. I used `x` for the depth/counter. – bobble bubble Jan 20 '16 at 15:32
@MrFox I can add the balanced regex from my comment above to the answer, if interesting for your problem to also match nested empty tags. – bobble bubble Jan 20 '16 at 15:37

Adriano Repetti · Answer 2 · 2016-01-20T14:28:07.150

1

AFAIK there is no need to capture any group because <a></b> (which would match a simple regex without capturing) is just invalid XML and it can't be in your file (unless you're parsing HTML in which case - even if may be done - I'd suggest to do not use regex). Capturing a group is required only if you're matching non empty nodes but it's not your case.

Note that you have a problem with your regex (besides unescaped /) because you're matching any character with . but it's not allowed to have any character in XML tags. If you absolutely want to use .* then it should be .*? and you should exclude /).

What I would do is to keep regex as simple as possible (still matching valid XML node names or - even better - only what you know is your data input):

<\w+><\/\w+>

You should/may have a better check for tag name, for example \s*[\w\d]+\s* may be slightly better, regex with less steps will perform better for very large files. Also you may want to add an optional new-line between opening and closing tag.

Note that you may need to loop until no more replacements are done if, for example, you have <outer><inner></inner></outer> and you want it to be reduced to an empty string (especially in this case don't forget to compile your regex).

edited Jan 20 '16 at 14:28

answered Jan 20 '16 at 13:34

Adriano Repetti

65,416
20
137
208

Are you saying putting everything on the same line is not valid xml? . Ah I see it would not match when using \w because you then you can make sure it begins with just < and not – MrFox Jan 20 '16 at 13:46
No, that's valid XML. What is not valid is ``: the only case which seems to _force_ you to capture a group (for empty tags!) However problem is that you're matching tag names with `.`. With any more strict (and valid!) match you won't have this problem. If you're looking for speed then you should prefer a simpler regex, generally speaking more it's complex and less is fast. – Adriano Repetti Jan 20 '16 at 14:04
Where do you see in the original posting? The issue is with empty elements which I suspect is being rejected by a schema in the xml reader being used. Re-read the original posting. – jdweng Jan 20 '16 at 14:08
@jdweng there is not but such pattern (which yes is invalid XML) is the only reason to capture a group to match an empty node. I refer to other answers, not to post. Please re-read my answer: what I say is that you don't have to match then you don't need to capture and regex can (should) be simpler and faster. OP is asking for a specific solution to **this problem** otherwise no one would ever suggest/encourage to use regex to deal with XML... – Adriano Repetti Jan 20 '16 at 14:12

score 0 · Answer 3 · answered Jan 20 '16 at 13:35

0

Use XML Linq

string xml = "<Item><One>1</One><Two></Two><Three>3</Three></Item>";
            XElement item = XElement.Parse(xml);
            item = new XElement("Item", item.Descendants().Where(x => x.Value.Length != 0));

answered Jan 20 '16 at 13:35

jdweng

33,250
2
15
20

This would match any elements that had an empty value, such as `` and `` and ``. – Dan Field Jan 20 '16 at 14:08
Read the original posting again. That is exactly what the person is looking for. – jdweng Jan 20 '16 at 14:09
I'm not so sure - on my reading they're looking only for the case of ``. – Dan Field Jan 20 '16 at 14:10
This would also remove elements that had child elements but no value in them, e.g. ``, which may or may not be desired. – Dan Field Jan 20 '16 at 14:12

Removing empty elements from xml with regex that matches a sequence twice

3 Answers3