Replace part of large XML file

Question

I have large XML file, and I need to replace elements with some name (and all inner elements) with another element. For example - if this element e:

<a>
<b></b>
<e>
   <b></b>
   <c></c>
</e>
</a>

After replace e for elem:

<a>
<b></b>
<elem></elem>
</a>

update: I try use XDocument but xml size more then 2gb and I have SystemOutOfMemoryException

update2: my code, but xml not transform

XmlReader reader = XmlReader.Create("xml_file.xml");
XmlWriter wr = XmlWriter.Create(Console.Out);
while (reader.Read())
   {
       if (reader.NodeType == XmlNodeType.Element && reader.Name == "e")
       {
           wr.WriteElementString("elem", "val1");
           reader.ReadSubtree();
       }
            wr.WriteNode(reader, false);
   }
wr.Close();

update 3:

<a>
<b></b>
<e>
   <b></b>
   <c></c>
</e>
<i>
  <e>
    <b></b>
    <c></c>
  </e>
</i> 
</a>

One easy way I can think of is to just parse the xml and when you find an element that you want to replace then just remove it from it's parent's ChildNodes and add the new empty element in it's place. — Adrian Buzea, Jul 14 '15 at 12:07
You might want to expand on the issue. On the face value what you asking is trivial, so any additional information as to what you problem is could help. Did you get any error messages? What does your code look like? (If you don't have any code yet you should start by writing it, and *then* when it does not work show it and ask for help here) — Andrew Savinykh, Jul 14 '15 at 12:09
You'll need to do something like [this](http://blogs.msdn.com/b/mfussell/archive/2005/02/12/371546.aspx). — Charles Mager, Jul 14 '15 at 13:34

Charles Mager · Accepted Answer · 2015-07-15T09:33:43.637

Taking inspiration from this blog post, you can basically just stream the contents of the XmlReader straight to the XmlWriter similarly to your example code, but handling all node types. Using WriteNode, as in your example code, will add the node and all child nodes, so you wouldn't be able to handle each descendant in your source XML.

In addition, you need to make sure you read to the end of the element you want to skip - ReadSubtree creates an XmlReader for this, but it doesn't actually do any reading. You need to ensure this is read to the end.

The resulting code might look like this:

using (var reader = XmlReader.Create(new StringReader(xml), rs))
using (var writer = XmlWriter.Create(Console.Out, ws))
{
    while (reader.Read())
    {
        switch (reader.NodeType)
        {
            case XmlNodeType.Element:
                var subTreeReader = reader.ReadSubtree();
                if (HandleElement(reader, writer))
                {
                    ReadToEnd(subTreeReader);
                }
                else
                {
                    writer.WriteStartElement(reader.Prefix, reader.LocalName, reader.NamespaceURI);
                    writer.WriteAttributes(reader, true);
                    if (reader.IsEmptyElement)
                    {
                        writer.WriteEndElement();
                    }
                }
                break;
            case XmlNodeType.Text:
                writer.WriteString(reader.Value);
                break;
            case XmlNodeType.Whitespace:
            case XmlNodeType.SignificantWhitespace:
                writer.WriteWhitespace(reader.Value);
                break;
            case XmlNodeType.CDATA:
                writer.WriteCData(reader.Value);
                break;
            case XmlNodeType.EntityReference:
                writer.WriteEntityRef(reader.Name);
                break;
            case XmlNodeType.XmlDeclaration:
            case XmlNodeType.ProcessingInstruction:
                writer.WriteProcessingInstruction(reader.Name, reader.Value);
                break;
            case XmlNodeType.DocumentType:
                writer.WriteDocType(reader.Name, reader.GetAttribute("PUBLIC"), reader.GetAttribute("SYSTEM"), reader.Value);
                break;
            case XmlNodeType.Comment:
                writer.WriteComment(reader.Value);
                break;
            case XmlNodeType.EndElement:
                writer.WriteFullEndElement();
                break;
        }
    }    
}

private static void ReadToEnd(XmlReader reader)
{
    while (!reader.EOF)
    {
        reader.Read();
    }
}

Obviously put whatever your logic is inside HandleElement, returning true if the element is handled (and therefore to be ignored). The implementation for the logic in your example code would be:

private static bool HandleElement(XmlReader reader, XmlWriter writer)
{
    if (reader.Name == "e")
    {
        writer.WriteElementString("element", "val1");
        return true;
    }

    return false;
}

Here is a working demo: https://dotnetfiddle.net/FFIBU4

@e1s Oops. Should have done some more testing. It seems `ReadOuterXml` isn't really a suitable replacement for reading the subtree reader through. I've updated the answer & demo. — Charles Mager, Jul 15 '15 at 09:34

score 1 · Answer 2 · answered Jul 14 '15 at 12:20

1

try this (saw the C# tag :D) :

        XElement elem = new XElement("elem");
        IEnumerable<XElement> listElementsToBeReplaced = xDocument.Descendants("e");
        foreach (XElement replaceElement in listElementsToBeReplaced)
        {
            replaceElement.AddAfterSelf(elem);
        }
        listElementsToBeReplaced.Remove();

answered Jul 14 '15 at 12:20

raduchept

181
4
11

I cant use `xDocument` because size of xml more then 2Gb and I have `SystemOutOfMemoryException`. Thanks for answer – e1s Jul 14 '15 at 12:28

score 0 · Answer 3 · answered Jul 14 '15 at 12:16

0

I would replace it with a regular expression, matching e elements with all its content and ending with the closing tag, and replacing it with the new elem element. This way you can do it in any editor with search/replace that supports regular expressions and programatically in any language.

answered Jul 14 '15 at 12:16

Hugo

109
1
1
8

I think it maybe need balance group to `Regex.Replace` – Sky Fang Jul 14 '15 at 12:18

score 0 · Answer 4 · answered Jul 14 '15 at 12:26

string xml = @"<a>
<b></b>
<e>
<b></b>
<c></c>
</e>
</a>";
string patten = @"<e[^>]*>[\s\S]*?(((?'Open'<e[^>]*>)[\s\S]*?)+((?'-Open'</e>)[\s\S]*?)+)*(?(Open)(?!))</e>";
Console.WriteLine(Regex.Replace(xml,patten,"<ele></ele>"));

Use regex,also can use LinqToXml

score 0 · Answer 5 · answered Jul 14 '15 at 12:41

// example data:
XDocument xmldoc = XDocument.Parse(
@"
<a>
<b></b>
<e>
   <b></b>
   <c></c>
</e>
<c />
<e>
   <b></b>
   <c></c>
   <c></c>
</e>
</a>
");
            // you can use xpath, then you need to add:
            // using System.Xml.XPath;
            List<XElement> elementsToReplace = xmldoc.XPathSelectElements("a/e").ToList();

            // or pure linq-to-sql:
            // elementsToReplace = xmldoc.Elements("a").Elements("e").ToList();

            foreach (XElement elem in elementsToReplace)
            {
                // setting Value of XElement to an empty string causes the resulting xml to look like this:
                // <elem></elem>
                // and not like this:
                // <elem />
                elem.ReplaceWith(new XElement("elem", ""));
                // if you don't mind self closing tags, then:
                // elem.ReplaceWith(new XElement("elem"));
            }

I didn't measure the performance but rumour has it the difference is not very significant.

XPath syntax, if you need it: http://www.w3schools.com/xpath/xpath_syntax.asp

@e1s you can manually set stack size of your worker thread to more than 2gb: Thread thread = new Thread(myDoWorkMethod, stackSize); If the memory is going to be the problem anyway, I think you should consider using the reader/writer approach that Charles Mager proposed in the comment to your question — Arie, Jul 14 '15 at 13:45

Replace part of large XML file

5 Answers5

Linked