1

I'm trying to replace multiple tables from a large (~300 MB) XML file with external XML files.

There are roughly 30,000 tables, and there are 23,000 XML files because some tables are left unchanged.

For example, if I had:

<?xml version="1.0" encoding="UTF-8"?>
<INI>
   <TABLE name="People">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Bob]]></Name>
      </ROW>
   </TABLE>
   <TABLE name="Animals">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Golden]]></Name>
      </ROW>
   </TABLE>
</INI>

I would have files called People.xml and Animals.xml that should be replaced.

If People.xml were:

   <TABLE name="People">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Mary]]></Name>
      </ROW>
      <ROW>
         <ID>2</ID>
         <Name><![CDATA[Bob]]></Name>
      </ROW>
      <ROW>
         <ID>3</ID>
         <Name><![CDATA[Dan]]></Name>
      </ROW>
   </TABLE>

then the main large XML file would become:

<?xml version="1.0" encoding="UTF-8"?>
<INI>
   <TABLE name="People">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Mary]]></Name>
      </ROW>
      <ROW>
         <ID>2</ID>
         <Name><![CDATA[Bob]]></Name>
      </ROW>
      <ROW>
         <ID>3</ID>
         <Name><![CDATA[Dan]]></Name>
      </ROW>
   </TABLE>
   <TABLE name="Animals">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Golden]]></Name>
      </ROW>
   </TABLE>
</INI>

and then the same for Animals.xml.

I've tried looking into String.Split(), but I couldn't find a way to do it like that.

Any help is appreciated. Thanks in advance!

Ken White
  • 123,280
  • 14
  • 225
  • 444
Raymonf
  • 114
  • 1
  • 16
  • Not sure I understand the question. Are you trying to replace one large file with several small files, or recreate the large file from the several smaller files? – dbc Mar 06 '15 at 02:44
  • @dbc - Kind of the first one. See, I have a large XML file with a lot of tables, and a lot of smaller XML files that only contain the table information. I want to replace the tables in the larger file with the smaller file's information. All of the smaller files are named `(table name).xml`. – Raymonf Mar 06 '15 at 03:02
  • Side note: creating such a large XML file is ... questionable idea. Most tools will not be able to comfortably handle that... – Alexei Levenkov Mar 06 '15 at 03:29
  • I assume you've tried regular `XDocument` API and it did not work for you due to memory restrictions... You'll need to use `XmlReader` API to sequentially read your "large XML" and in parallel write it out with `XmlWriter`, but replace corresponding nodes with content from separate files. Be careful if you must keep CDATA sections as they normally not visible to XML readers. – Alexei Levenkov Mar 06 '15 at 03:31
  • @AlexeiLevenkov - CDATA is used for names. Will XmlReader/Writer be able to handle that? I'm not sure how to execute such a thing, so that's why I put up the question. I have up to 20 GB of RAM available, so memory is not much of an issue I'd assume. :( – Raymonf Mar 06 '15 at 03:40
  • I mean "CDATA normally not visible to XML readers" as "XML readers will see Text and CDATA nodes the same", just the fact that there is CDATA usually ignored but value is obviously there. – Alexei Levenkov Mar 06 '15 at 03:56
  • Amount of available RAM usually does not matter - it is size of objects/strings that is limiting factor (and for 32bit process overall address space of 4GB, but I assume you compile for x64 so it does not matter). – Alexei Levenkov Mar 06 '15 at 03:58
  • "I'm not sure how to execute such a thing" - not sure what you mean... You need to be a bit more concrete on SO... – Alexei Levenkov Mar 06 '15 at 03:59
  • Well, I'm not changing the CDATA values, but the entire table which contains CDATA. I meant that I'm not sure how to make it read from the individual smaller files and write to the larger XML file. And yes, I compile for x64. – Raymonf Mar 06 '15 at 04:51

1 Answers1

2

What you can do is to take the basic logic of streaming an XmlReader to an XmlWriter from Mark Fussell's article Combining the XmlReader and XmlWriter classes for simple streaming transformations to patch the contents of one XML file into another:

public abstract class XmlStreamingEditorBase
{
    readonly XmlReader reader;
    readonly XmlWriter writer;
    readonly Predicate<XmlReader> shouldTransform;

    public XmlStreamingEditorBase(XmlReader reader, XmlWriter writer, Predicate<XmlReader> shouldTransform)
    {
        this.reader = reader;
        this.writer = writer;
        this.shouldTransform = shouldTransform;
    }

    protected XmlReader Reader { get { return reader; } }

    protected XmlWriter Writer { get { return writer; } }

    public void Process()
    {
        while (Reader.Read())
        {
            if (Reader.NodeType == XmlNodeType.Element)
            {
                if (shouldTransform(Reader))
                {
                    EditCurrentElement();
                    continue;
                }
            }
            Writer.WriteShallowNode(Reader);
        }
    }

    protected abstract void EditCurrentElement();
}

public class XmlStreamingEditor : XmlStreamingEditorBase
{
    readonly Action<XmlReader, XmlWriter> transform;

    public XmlStreamingEditor(XmlReader reader, XmlWriter writer, Predicate<XmlReader> shouldTransform, Action<XmlReader, XmlWriter> transform)
        : base(reader, writer, shouldTransform)
    {
        this.transform = transform;
    }

    protected override void EditCurrentElement()
    {
        using (var subReader = Reader.ReadSubtree())
        {
            transform(subReader, Writer);
        }
    }
}

public class XmlStreamingPatcher
{
    readonly XmlReader patchReader;
    readonly XmlReader reader;
    readonly XmlWriter writer;
    readonly Predicate<XmlReader> shouldPatchFrom;
    readonly Func<XmlReader, XmlReader, bool> shouldPatchFromTo;
    bool patched = false;

    public XmlStreamingPatcher(XmlReader reader, XmlWriter writer, XmlReader patchReader, Predicate<XmlReader> shouldPatchFrom, Func<XmlReader, XmlReader, bool> shouldPatchFromTo)
    {
        if (reader == null || writer == null || patchReader == null || shouldPatchFrom == null || shouldPatchFromTo == null)
            throw new ArgumentNullException();
        this.reader = reader;
        this.writer = writer;
        this.patchReader = patchReader;
        this.shouldPatchFrom = shouldPatchFrom;
        this.shouldPatchFromTo = shouldPatchFromTo;
    }

    public bool Process()
    {
        patched = false;
        while (patchReader.Read())
        {
            if (patchReader.NodeType == XmlNodeType.Element)
            {
                if (shouldPatchFrom(patchReader))
                {
                    var editor = new XmlStreamingEditor(reader, writer, ShouldPatchTo, PatchNode);
                    editor.Process();
                    return patched;
                }
            }
        }
        return false;
    }

    bool ShouldPatchTo(XmlReader reader)
    {
        return shouldPatchFromTo(patchReader, reader);
    }

    void PatchNode(XmlReader reader, XmlWriter writer)
    {
        using (var subReader = patchReader.ReadSubtree())
        {
            while (subReader.Read())
            {
                writer.WriteShallowNode(subReader);
                patched = true;
            }
        }
    }
}

public static class XmlReaderExtensions
{
    public static XName GetElementName(this XmlReader reader)
    {
        if (reader == null)
            return null;
        if (reader.NodeType != XmlNodeType.Element)
            return null;
        string localName = reader.Name;
        string uri = reader.NamespaceURI;
        return XName.Get(localName, uri);
    }
}

public static class XmlWriterExtensions
{
    public static void WriteShallowNode(this XmlWriter writer, XmlReader reader)
    {
        // adapted from http://blogs.msdn.com/b/mfussell/archive/2005/02/12/371546.aspx
        if (reader == null)
            throw new ArgumentNullException("reader");

        if (writer == null)
            throw new ArgumentNullException("writer");

        switch (reader.NodeType)
        {
            case XmlNodeType.Element:
                writer.WriteStartElement(reader.Prefix, reader.LocalName, reader.NamespaceURI);
                writer.WriteAttributes(reader, true);
                if (reader.IsEmptyElement)
                {
                    writer.WriteEndElement();
                }
                break;

            case XmlNodeType.Text:
                writer.WriteString(reader.Value);
                break;

            case XmlNodeType.Whitespace:
            case XmlNodeType.SignificantWhitespace:
                writer.WriteWhitespace(reader.Value);
                break;

            case XmlNodeType.CDATA:
                writer.WriteCData(reader.Value);
                break;

            case XmlNodeType.EntityReference:
                writer.WriteEntityRef(reader.Name);
                break;

            case XmlNodeType.XmlDeclaration:
            case XmlNodeType.ProcessingInstruction:
                writer.WriteProcessingInstruction(reader.Name, reader.Value);
                break;

            case XmlNodeType.DocumentType:
                writer.WriteDocType(reader.Name, reader.GetAttribute("PUBLIC"), reader.GetAttribute("SYSTEM"), reader.Value);
                break;

            case XmlNodeType.Comment:
                writer.WriteComment(reader.Value);
                break;

            case XmlNodeType.EndElement:
                writer.WriteFullEndElement();
                break;

            default:
                Debug.WriteLine("unknown NodeType " + reader.NodeType);
                break;

        }
    }
}

To create instances XmlReader and XmlWriter to read and write XML from files, use XmlReader.Create(string) and XmlWriter.Create(string). Also, be sure to stream the large file into a temporary file and only replace the original after editing is finished.

And then, to test:

public static class TestXmlStreamingPatcher
{
    public static void Test()
    {
        string mainXml = @"<?xml version=""1.0"" encoding=""UTF-8""?>
<INI>
   <TABLE name=""People"">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Bob]]></Name>
      </ROW>
   </TABLE>
   <TABLE name=""Animals"">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Golden]]></Name>
      </ROW>
   </TABLE>
</INI>
";
        string patchXml = @"<TABLE name=""People"">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Mary]]></Name>
      </ROW>
      <ROW>
         <ID>2</ID>
         <Name><![CDATA[Bob]]></Name>
      </ROW>
      <ROW>
         <ID>3</ID>
         <Name><![CDATA[Dan]]></Name>
      </ROW>
   </TABLE>
";
        var patchedXml1 = TestPatch(mainXml, patchXml);
        Debug.WriteLine(patchedXml1);
    }

    private static string TestPatch(string mainXml, string patchXml)
    {
        using (var mainReader = new StringReader(mainXml))
        using (var mainXmlReader = XmlReader.Create(mainReader))
        using (var patchReader = new StringReader(patchXml))
        using (var patchXmlReader = XmlReader.Create(patchReader))
        using (var mainWriter = new StringWriter())
        {
            using (var mainXmlWriter = XmlWriter.Create(mainWriter))
            {
                var patcher = new XmlStreamingPatcher(mainXmlReader, mainXmlWriter, patchXmlReader, ShouldPatchFrom, ShouldPatchFromTo);
                patcher.Process();
            }
            return mainWriter.ToString();
        }
    }

    static bool ShouldPatchFrom(XmlReader reader)
    {
        return reader.GetElementName() == "TABLE";
    }

    static bool ShouldPatchFromTo(XmlReader patchReader, XmlReader toReader)
    {
        if (patchReader.GetElementName() != toReader.GetElementName())
            return false;
        string name = patchReader.GetAttribute("name");
        if (string.IsNullOrEmpty(name))
            return false;
        return name == toReader.GetAttribute("name");
    }
}

The output of TestXmlStreamingPatcher.Test() from this class is

<?xml version="1.0" encoding="UTF-8"?>
<INI>
   <TABLE name="People">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Mary]]></Name>
      </ROW>
      <ROW>
         <ID>2</ID>
         <Name><![CDATA[Bob]]></Name>
      </ROW>
      <ROW>
         <ID>3</ID>
         <Name><![CDATA[Dan]]></Name>
      </ROW>
   </TABLE>
   <TABLE name="Animals">
      <ROW>
         <ID>1</ID>
         <Name><![CDATA[Golden]]></Name>
      </ROW>
   </TABLE>
</INI>

which is what you want.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • I'd assume I'd just do a `foreach` loop for every file in the folder with the smaller XML files, and then read from the file afterwards? – Raymonf Mar 06 '15 at 17:25
  • Yes. You can use [`XmlReader.Create(string)`](https://msdn.microsoft.com/en-us/library/w8k674bf%28v=vs.110%29.aspx) and [`XmlWriter.Create(string)`](https://msdn.microsoft.com/en-us/library/kcsse48t%28v=vs.110%29.aspx) to read and write from files. I used the stringreader/stringwriter versions for simplicity of testing. – dbc Mar 06 '15 at 17:30
  • Also, you'll need to write the large file to a temp file and then replace the original afterwards. – dbc Mar 06 '15 at 17:31
  • Got it. I didn't know what to start with. This should be a good starting point. Thanks a lot! – Raymonf Mar 06 '15 at 17:34