0

I have a 3GB XML file. I need to move nodes as child node of another. Loading large file as XmlDocument is not efficient. I see XmlReader is another approach but not sure exactly how it will work in my scenario and what other classes I should be using to do this.

I need to move all alias node to its related customer>name node.

<customer>
<name><first>Robert</first></name>
<alias>Rob</alias>
</customer>
dbc
  • 104,963
  • 20
  • 228
  • 340
Sam Seth
  • 11
  • 1
  • 1
    A 3 *giga*-byte XML file? Maybe splitting this file into smaller files would be a good first step. –  Feb 02 '18 at 02:07
  • Does a 64-bit app let you read the XML in using `XDocument`? – Enigmativity Feb 02 '18 at 02:08
  • Something like this maybe? [Automating replacing tables from external files](https://stackoverflow.com/q/28891440/3744182) – dbc Feb 02 '18 at 02:15

2 Answers2

1

What you can do is to take the basic logic of streaming an XmlReader to an XmlWriter from Mark Fussell's article Combining the XmlReader and XmlWriter classes for simple streaming transformations to transform your 3GB file into a modified file in which the <alias> nodes have been relocated to the <name> nodes. An example of using such streaming transformations is given in this answer to Automating replacing tables from external files.

Using that answer as a basis, grab the classes XmlReaderExtensions, XmlWriterExtensions, XmlStreamingEditorBase and XmlStreamingEditor from it and subclass XmlStreamingEditor to create CustomerAliasXmlEditor as follows:

class CustomerAliasXmlEditor : XmlStreamingEditor
{
    // Confirm that the <customer> element is not in any namespace.
    static readonly XNamespace customerNamespace = ""; 
    
    public static void TransformFromTo(string fromFilePath, XmlReaderSettings readerSettings, string toFilePath, XmlWriterSettings writerSettings)
    {
        using (var xmlReader = XmlReader.Create(fromFilePath, readerSettings))
        using (var xmlWriter = XmlWriter.Create(toFilePath, writerSettings))
        {
            new CustomerAliasXmlEditor(xmlReader, xmlWriter).Process();
        }
    }

    public CustomerAliasXmlEditor(XmlReader reader, XmlWriter writer)
        : base(reader, writer, ShouldTransform, Transform)
    {
    }

    static bool ShouldTransform(XmlReader reader)
    {
        return reader.GetElementName() == customerNamespace + "customer";
    }

    static void Transform(XmlReader from, XmlWriter to)
    {
        var customer = XElement.Load(from);
        var alias = customer.Element(customerNamespace + "alias");
        if (alias != null)
        {
            var name = customer.Element(customerNamespace + "name");
            if (name == null)
            {
                name = new XElement(customerNamespace + "name");
                customer.Add(name);
            }
            alias.Remove();
            name.Add(alias);
        }
        customer.WriteTo(to);
    }
}

Then if fromFileName is the name of your current 3GB XML file and toFileName is the name of the file to which to output the transformed XML, you can do:

var readerSettings = new XmlReaderSettings { IgnoreWhitespace = true };
var writerSettings = new XmlWriterSettings { Indent = false}; // Or true if you prefer.

CustomerAliasXmlEditor.TransformFromTo(fromFileName, readerSettings, toFileName, writerSettings);

Sample working .Net fiddle showing that the XML

<Root>
<Item>
<SubItem>
<customer>
<name><first>Robert</first></name>
<alias>Rob</alias>
</customer>
</SubItem>
</Item>
<Item>
</Root>

Is transformed to

<Root>
  <Item>
    <SubItem>
      <customer>
        <name>
          <first>Robert</first>
          <alias>Rob</alias>
        </name>
      </customer>
    </SubItem>
  </Item>
  <Item>
</Root>
dbc
  • 104,963
  • 20
  • 228
  • 340
  • I like your solution a lot, but I think it makes a bit assumption about the structure of the xml and do not think it will work with huge file.We should get a sample of the structure from the op.Suspecting the xml file contains a node .So the solution would be to create a write xml file up to and including .Do a two pass parse of the file.First pass create a group table of elements by user name.Second pass write the groups. File is so large maybe use the position property of XmlReader base stream to locate each record.This solution may be slow but will avoid the mem err. – jdweng Feb 02 '18 at 06:14
1

I don't really understand exactly what transformation you want to perform, but assuming that @dbc's guess is correct, you could do it with a streaming XSLT 3.0 processor like this:

<xsl:transform version="3.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:mode streamable="yes" on-no-match="shallow-copy">

<xsl:template match="customer">
  <xsl:apply-templates select="copy-of(.)" mode="local"/>
</xsl:template>

<xsl:mode name="local" streamable="no" on-no-match="shallow-copy"/>

<xsl:template match="name" mode="local">
  <name>
    <xsl:apply-templates mode="local"/>
    <xsl:copy-of select="../alias"/>
  </name>
</xsl:template>

<xsl:template match="alias" mode="local"/>

</xsl:transform>

What's happening here is that everything gets copied in pure streaming mode (tag for tag) until we hit a customer element. When we encounter a customer element we make an in-memory copy of the element and transform it locally using a conventional non-streaming transformation. So the amount of memory needed is just enough to hold the largest customer element.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164