Merging huge (2GB) XMLs in memory (without any memory exceptions)

Question

I would like a C# code that optimally appends 2 XML strings. Both of them are of same schema. I tried StreamReader / StreamWriter; File.WriteAllText; FileStream The problem I see is, it uses more than 98% of physical memory thus results in out of memory exception.

Is there a way to optimally merge without getting any memory exceptions? Time is not a concern for me.

If making it available in memory is going to be a problem, then what else could be better? Saving it on File system?

Further Details: Here is my simple program: to provide better detail

static void Main(string[] args)
        {
            Program p = new Program();
            XmlDocument x1 = new XmlDocument();
            XmlDocument x2 = new XmlDocument();
            x1.Load("C:\\XMLFiles\\1.xml");
            x2.Load("C:\\XMLFiles\\2.xml");
            List<string> files = new List<string>();
            files.Add("C:\\XMLFiles\\1.xml");
            files.Add("C:\\XMLFiles\\2.xml");
            p.ConsolidateFiles(files, "C:\\XMLFiles\\Result.xml");
            p.MergeFiles("C:\\XMLFiles\\Result.xml", x1.OuterXml, x2.OuterXml, "<Data>", "</Data>");
            Console.ReadLine();

        }

        public void ConsolidateFiles(List<String> files, string outputFile)
        {
            var output = new StreamWriter(File.Open(outputFile, FileMode.Create));
            output.WriteLine("<Data>");
            foreach (var file in files)
            {
                var input = new StreamReader(File.Open(file, FileMode.Open));
                string line;
                while (!input.EndOfStream)
                {
                    line = input.ReadLine();
                    if (!line.Contains("<Data>") &&
                        !line.Contains("</Data>"))
                    {
                        output.Write(line);
                    }
                }
            }
            output.WriteLine("</Data>");
        }
        public void MergeFiles(string outputPath, string xmlState, string xmlFederal, string prefix, string suffix)
        {
            File.WriteAllText(outputPath, prefix);
            File.AppendAllText(outputPath, xmlState);
            File.AppendAllText(outputPath, xmlFederal);
            File.AppendAllText(outputPath, suffix);
        }

XML Sample: <Data> </Data> is appended at the beginning & End

XML 1: <Sections> <Section></Section> </Sections>

XML 2: <Sections> <Section></Section> </Sections>

Merged: <Data> <Sections> <Section></Section> </Sections> <Sections> <Section></Section> </Sections> </Data>

What sort of merge are you talking about? We'll need more details if we're going to be able to help you. — Jon Skeet, Sep 11 '14 at 15:05
You can't just append 2 valid XML documents; at the very least that would result in an illegal document because it would have two root-level elements. — phoog, Sep 11 '14 at 15:09
Indeed - sample (small) input and expected output documents would help. — Jon Skeet, Sep 11 '14 at 15:12
Do not use XmlDocument (that parses and loads the whole 2GB), and try commenting out MergeFiles, it seems to be redundant - ConsolidateFiles already does the merge using streams. — Polyfun, Sep 11 '14 at 15:19
@ShellShock: Yes! I provided both the code blocks just to post what I have tried so far! Memory exception in either of them! — CodeMad, Sep 11 '14 at 15:20
What do you mean by saying you need to merge them "in memory"? I think that's not going to work by definition - if you don't have enough memory to store the xml documents you can't merge them all in-memory. You will need to have enough memory for at least the result in memory, or write to disk as you go. — Tim Copenhaver, Sep 11 '14 at 15:20
@TimCopenhaver: I understand that memory is needed at least to the merged size. In which case, memory exception is pretty obvious. But is there a way to do it memory efficiently? (i.e. breaking the XML to discrete chunks and processing one by one? Something like that) — CodeMad, Sep 11 '14 at 15:23
Even if you manage to merge the files, will any application be able to consume this huge resulting file? — Olivier Jacot-Descombes, Sep 11 '14 at 15:24
@OlivierJacot-Descombes: The outcome of this is for another Queuing process. A PDF will be generated out of this merged XML — CodeMad, Sep 11 '14 at 15:26
Yes, but this other process will run into the same memory problem. — Olivier Jacot-Descombes, Sep 11 '14 at 15:28
@OlivierJacot-Descombes: Thanks for the comment but i don't think that is going to be the case because the Queuing runs on a highly sophisticated hardware environment which is meant to process GBs of data. But I can't deploy the merge code onto such servers. — CodeMad, Sep 11 '14 at 15:30
@CodeMad, you get a memory exception with both methods because in Main you are always loading all the xml into the x1/x2 XmlDocuments. That is what is killing your program, not the ConsolidateFiles method. — Polyfun, Sep 11 '14 at 15:33

Polyfun · Answer 1 · 2014-09-11T15:36:42.650

Try this, a stream based approach which avoids loading all the xml into memory at once.

    static void Main(string[] args)
    {
        List<string> files = new List<string>();
        files.Add("C:\\XMLFiles\\1.xml");
        files.Add("C:\\XMLFiles\\2.xml");
        ConsolidateFiles(files, "C:\\XMLFiles\\Result.xml");
        Console.ReadLine();
    }

    private static void ConsolidateFiles(List<String> files, string outputFile)
    {
        using (var output = new StreamWriter(outputFile))
        {
            output.WriteLine("<Data>");
            foreach (var file in files)
            {
                using (var input = new StreamReader(file, FileMode.Open))
                {
                    while (!input.EndOfStream)
                    {
                        string line = input.ReadLine();
                        if (!line.Contains("<Data>") &&
                            !line.Contains("</Data>"))
                        {
                            output.Write(line);
                        }
                    }
                }
            }
            output.WriteLine("</Data>");
        }
    }

An even better approach is to use XmlReader (http://msdn.microsoft.com/en-us/library/system.xml.xmlreader(v=vs.90).aspx). This will give you a stream reader designed specifically for xml, rather than StreamReader which is for reading general text.

Same as manhattan, I don't see any reason to use StreamReader or StreamWriter for this. C# has great handling for XML files, and you lose a lot of useful features by cutting them out and treating an XML document like a string. — Tim Copenhaver, Sep 11 '14 at 16:02

score 0 · Answer 2 · edited May 23 '17 at 11:43

0

Take a look here The answer given by Teoman Soygul seems to be what you're looking for.

edited May 23 '17 at 11:43

Community

1
1

answered Sep 11 '14 at 15:48

Gentian Kasa

776
6
10

score 0 · Answer 3 · answered Sep 11 '14 at 16:01

This is untested, but I would do something along these lines using TextReader and TextWriter. You do not want to read all of the XML text into memory or store it in a string, and you do not want to use XElement/XDocument/etc. anywhere in the middle.

using (var writer = new XmlTextWriter("ResultFile.xml")
{
    writer.WriteStartDocument();
    writer.WriteStartElement("Data");
    using (var reader = new XmlTextReader("XmlFile1.xml")
    {
        reader.Read();
        while (reader.Read())
        {
            writer.WriteNode(reader, true);
        }
    }
    using (var reader = new XmlTextReader("XmlFile2.xml")
    {
        reader.Read();
        while (reader.Read())
        {
            writer.WriteNode(reader, true);
        }
    }
    writer.WriteEndElement("Data");
}

Again no guarantees that this exact code will work as-is (or that it even compiles), but I think that is the idea you're looking for. Stream data from File1 first and write it directly out to the result file. Then, stream data from File2 and write it out. At no point should a full XML file be in memory.

Can you post the exact code you're using? This code uses streams, so it should not take a significant amount of memory. The same is true for ShellShock's solution - there is something else going on. — Tim Copenhaver, Sep 12 '14 at 15:44

score 0 · Answer 4 · answered Aug 26 '15 at 09:43

0

If you run on 64bit, try this: go to your project properties -> build tab -> Platform target: change "Any CPU" to "x64".

This solved my problem for loading huge XML files in memory.

answered Aug 26 '15 at 09:43

BTC

2,975
1
19
25

score -1 · Answer 5 · answered Sep 11 '14 at 15:40

-1

you have to go to file system, unless you have lots of RAM one simple approach:

File.WriteAllText("output.xml", "<Data>");
File.AppendAllText("output.xml", File.ReadAllText("xml1.xml"));
File.AppendAllText("output.xml", File.ReadAllText("xml2.xml"));
File.AppendAllText("output.xml", "</Data>");

another:

var fNames = new[] { "xml1.xml", "xml2.xml" };
string line;
using (var writer = new StreamWriter("output.xml"))
{
    writer.WriteLine("<Data>");
    foreach (var fName in fNames)
    {
        using (var file = new System.IO.StreamReader(fName))
        {
            while ((line = file.ReadLine()) != null)
            {
                writer.WriteLine(line);
            }
        }
    }
    writer.WriteLine("</Data>");
}

All of this with the premise that there is not schema, or tags inside xml1.xml and xml2.xml If that is the case, just code to omit them.

answered Sep 11 '14 at 15:40

dariogriffo

4,148
3
17
34

I don't think StreamReader or StreamWriter should be used directly for this. You lose all of the XML handling features by treating everything like a simple string. – Tim Copenhaver Sep 11 '14 at 16:02
@TimCopenhaver the bottleneck here is the memory, and creating the minimum amount possible of objects is the best for That is why I proposed 2 different solutions without parsing anything at all. – dariogriffo Sep 11 '14 at 16:06
That's halfway true. Really, creating a few new classes doesn't have any noticeable impact on memory usage. The real goal is to avoid reading all of the XML data in all at once. Your first solution will still take too much memory, but your second one does meet the streaming goal. It will technically work, I just don't think it's the best solution since it treats everything like simple text. – Tim Copenhaver Sep 11 '14 at 16:10
I would accept your thoughts manipulating xml if he needs to merge the nodes with some criteria (order them or something else), but he ask just to append both files inside a file and surrounding them with only one tag. – dariogriffo Sep 11 '14 at 16:22

Merging huge (2GB) XMLs in memory (without any memory exceptions)

5 Answers5