How to clean up XML attributes using regex?

Question

I'd like to remove all the attributes from my XML structure. My choice is regex but if there's a simpler way, I'm wide open for suggestions.

To pick out a single, fix tag I used the following.

String clean = Regex.Replace(filled, ".*?<holder[^>]*?>(.*?)</holder>.*?", "$1");

That gives me the contents of the tag holder. I'd like now to keep the text mass but omit all the attributes in the inside tags. I've tried the following approach.

String plain1 = Regex.Replace(clean, "(<[^>]*?>)(.*?)(</[^>]*?>)", "$1$2$3");
String plain2 = Regex.Replace(clean, "(<[a-zA-Z]*?)([^>]*?)(>)", "$1$3");

But it gives me just the same stuff back (plain1) and just empty tags with no original names (plain2). Nothing is getting cleaned up or everything is. What do I do wrong?

I've noticed that changing start for plus, gives me tags that contain only the first letter of the names, so I'm pretty sure that the following is the right way to go as long as I can make the picked up section for $1 maximally large. How do I do that?

String plain3 = Regex.Replace(clean, "(<[a-zA-Z]+?)([^>]*?)(>)", "$1$3");

Please don't use Regex for anything XML related. There are many better solutions out there. — Jason Watkins, Mar 23 '13 at 22:46
Care to mention three of them in descending order of popularity and appropriateness? — , Mar 23 '13 at 23:00
This subject has already been exhaustively covered on this site and others. A quick search will lead you to all you could possible need to know. — Jason Watkins, Mar 23 '13 at 23:08
A quick search I've just made leads to regular expressions. Please support your original comment by a few examples. I'm assuming that you're talking about *XDocument* etc. but that's a very bold statement to say that it's better then regex for **any** case related to XML. — Konrad Viltersten, Mar 23 '13 at 23:16

Konrad Viltersten · Answer 1 · 2013-03-23T23:10:42.913

2

You need to skip the question mark in the first parentheses.

String plain3 = Regex.Replace(clean, "(<[a-zA-Z]+)([^>]*?)(>)", "$1$3");

Some observations.

You'll need to handle the closing tag. You're skipping slash character right now.

Regex.Replace(clean, "(<[/a-zA-Z]+)([^>]*?)(>)", "$1$3");

You have no need for $2. Not really for $3, neither.

Regex.Replace(clean, "(<[a-zA-Z]+)[^>]*?>", "$1>");

There are better ways to express "only letters" in regex.

Regex.Replace(clean, @"(<[\w]+)([^>]*?)(>)", "$1$3");

So in the end, you might end up with the following.

Regex.Replace(clean, @"(<[/\w]*)[^>]*?>", "$1>");

edited Mar 23 '13 at 23:10

answered Mar 23 '13 at 22:46

Konrad Viltersten

36,151
76
250
438

No need for `?`, and why does he have to handle the closing tag? – MikeM Mar 23 '13 at 23:25
I might have been mistaken. When I tested first, I got it to fail on lazy evaluation. But it seems that you're right. Thanks for pointing that out. Closing tag needs to be handled or you'll get an empty tag, won't you? Anyway, the last suggested regex does the job, it seems. – Konrad Viltersten Mar 23 '13 at 23:49

score 2 · Answer 2 · edited May 23 '17 at 12:01

My choice is regex but if there's a simpler way, I'm wide open for suggestions.

I guess you already know this. Don't try to parse xml/html with regex, use a real xml parser to process xmls

I'll use Linq To XML. It can be done easily with the help of a recursive function

var xDoc = XDocument.Load(fileName1);
RemoveAttributes(xDoc.Root);
xDoc.Save(fileName2);

void RemoveAttributes(XElement xRoot)
{
    foreach (var xAttr in xRoot.Attributes().ToList())
        xAttr.Remove();

    foreach (var xElem in xRoot.Descendants())
        RemoveAttributes(xElem);
}

score 1 · Answer 3 · answered Mar 23 '13 at 23:18

Please don't use regex for this.

Here is sample how you can achieve it with pure XML (first half is just console stuff, the method you need is ProcessNode):

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
using System.Xml.XPath;

internal static class Program
{
    public static void Main(string[] args)
    {
        var xmlFile = XElement.Load(@"c:\file.xml"); // Use your file here
        var blockquote = xmlFile.XPathSelectElement("/");

        var doc = new XDocument();
        doc.Add(new XElement("root"));
        var processedNodes = ProcessNode(blockquote);
        foreach (var node in processedNodes)
        {
            doc.Root.Add(node);
        }

        var sb = new StringBuilder();
        var settings = new XmlWriterSettings();
        settings.OmitXmlDeclaration = true;
        settings.Encoding = Encoding.UTF8;
        settings.Indent = true;
        using (var sw = XmlWriter.Create(sb, settings))
        {
            doc.WriteTo(sw);
        }

        Console.OutputEncoding = Encoding.UTF8;
        Console.WriteLine(sb);
    }

    private static IEnumerable<XNode> ProcessNode(XElement parent)
    {
        foreach (var node in parent.Nodes())
        {
            if (node is XText)
            {
                yield return node;
            }
            else if (node is XElement)
            {
                var container = (XElement)node;
                var copy = new XElement(container.Name.LocalName);
                var children = ProcessNode(container);
                foreach (var child in children)
                {
                    copy.Add(child);
                }
                yield return copy;
            }
        }
    }
}

While I do agree that regex is evil and should hardly be used, in this particular example, it offers a solution of one line of code. *XDocument* example need scrolling... :) — Konrad Viltersten, Mar 23 '13 at 23:20

How to clean up XML attributes using regex?

3 Answers3

Linked