2

I'm working against an interface that requires an XML document. So far I've been able to serialize most of the objects using XmlSerializer. However, there is one property that is proving problematic. It is supposed to be a collection of objects that wrap a document. The document itself is encoded as a base64 string.

The basic structure is like this:

//snipped out of a parent object
public List<Document> DocumentCollection { get; set; }
//end snip

public class Document
    {
        public string DocumentTitle { get; set; }
        public Code DocumentCategory { get; set; }
        /// <summary>
        /// Base64 encoded file
        /// </summary>
        public string BinaryDocument { get; set; }
        public string DocumentTypeText { get; set; }
    }

The problem is that smaller values work fine, but if the document is too big the serializer just skips over that document item in the collection.

Is there some limitation that I'm bumping up against?

Update: I changed

public string BinaryDocument { get; set; }

to

public byte[] BinaryDocument { get; set; }

and I'm still getting the same result. The smaller document (~150kb) is serializing just fine, but the rest aren't. To be clear, it's not just the value of the property, it's the entire containing Document object that gets dropped.

UPDATE 2:

Here's the serialization code with a simple repro. It's out of a console project I put together. The problem is that this code works fine in the test project. I'm having difficulty getting the full object structure packed in here because it's near impossible to use the actual objects in a test case because of the complexity of filling the fields, so I tried to cut down the code in the main application. The populated object goes into the serialization code with the DocumentCollection filled with four Documents and comes out with one Document.

using System.Collections.Generic;
using System.IO;
using System.Text;
using System.Xml;
using System.Xml.Serialization;

namespace ConsoleApplication2
{
    class Program
    {
        static void Main(string[] args)
        {
            var container = new DocumentContainer();
            var docs = new List<Document>();
            foreach (var f in Directory.GetFiles(@"E:\Software Projects\DA\Test Documents"))
            {
                var fileStream = new MemoryStream(File.ReadAllBytes(f));
                var doc = new Document
                {
                    BinaryDocument = fileStream.ToArray(),
                    DocumentTitle = Path.GetFileName(f)
                };

                docs.Add(doc);
            }

            container.DocumentCollection = docs;

            var serializer = new XmlSerializer(typeof(DocumentContainer));
            var ms = new MemoryStream();
            var writer = XmlWriter.Create(ms);

            serializer.Serialize(writer, container);
            writer.Flush();
            ms.Seek(0, SeekOrigin.Begin);

            var reader = new StreamReader(ms, Encoding.UTF8);
            File.WriteAllText(@"C:\temp\testexport.xml", reader.ReadToEnd());
        }
    }

    public class Document
    {
        public string DocumentTitle { get; set; }
        public byte[] BinaryDocument { get; set; }
    }

    // test class
    public class DocumentContainer
    {
        public List<Document> DocumentCollection { get; set; }
    }
}
Devin Goble
  • 2,639
  • 4
  • 30
  • 44
  • How are you determining that the `` elements are missing? I ask because I think successfully created a `testexport.xml` for 5 documents with an average size of 1.8 GB -- but "findstr" and "find" seem to be unable to search such a large text file. – dbc Mar 29 '16 at 22:26
  • I just open up the resulting XML file in Notepad++ and format the XML. I just need to scroll down and see that the elements that I'm looking for are missing. – Devin Goble Mar 29 '16 at 22:29

2 Answers2

4

XmlSerializer has no limit on the length of a string it can serialize.

.Net, however, has a maximum string length of int.MaxValue. Furthermore, since internally a string is implemented as a contiguous memory buffer, on a 32 bit process you're likely to be unable to allocate a string anywhere near that large due to process space fragmentation. And since a c# base64 string requires roughly 2.67 times the memory of the byte [] array from which it was created (1.33 for the encoding times 2 since the .Net char type is actually two bytes) you might be getting an OutOfMemoryException encoding a large binary document as a complete base64 string, then swallowing and ignoring it, leaving the BinaryDocument property null.

That being said, there is no reason for you to manually encode your binary documents into base64, because XmlSerializer does this for you automatically. I.e. if I serialize the following class:

public class Document
{
    public string DocumentTitle { get; set; }
    public Code DocumentCategory { get; set; }
    public byte [] BinaryDocument { get; set; }
    public string DocumentTypeText { get; set; }
}

I get the following XML:

<Document>
  <DocumentTitle>my title</DocumentTitle>
  <DocumentCategory>Default</DocumentCategory>
  <BinaryDocument>AAECAwQFBgcICQoLDA0ODxAREhM=</BinaryDocument>
  <DocumentTypeText>document text type</DocumentTypeText>
</Document>

As you can see, BinaryDocument is base64 encoded. Thus you should be able to keep your binary documents in a more compact byte [] representation and still get the XML output you want.

Even better, under the covers, XmlWriter uses System.Xml.Base64Encoder to do this. This class encodes its inputs in chunks, thereby avoiding the excessive memory use and potential out-of-memory exceptions described above.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • Sorry to pull the answer. It seemed like it was working, but it turns out that the set of documents I was running against had changed while I wasn't paying attention. Updates in the original question. – Devin Goble Mar 29 '16 at 19:53
  • @kettch - in that case, can you create a [complete example](http://stackoverflow.com/help/mcve) of how to reproduce the problem? For instance, using fake data like `BinaryDocument = Enumerable.Range(0, 10000000).Select(b => unchecked((byte)b)).ToArray()` ? Or if you can't reproduce it with fake data, can you show how you load your file(s) into your `BinaryDocument`? – dbc Mar 29 '16 at 20:18
  • I reset this as the answer because it *is* the answer to the question I asked. I'll pursue the other issue in another question. – Devin Goble Mar 31 '16 at 14:23
1

I can't reproduce the problem you are having. Even with individual files as large as 267 MB to 1.92 GB, I'm not seeing any elements being skipped. The only problem I am seeing is that the temporary var ms = new MemoryStream(); exceeds its 2 GB buffer limit eventually, whereupon an exception gets thrown. I replaced this with a direct stream, and that problem went away:

using (var stream = File.Open(outputPath, FileMode.Create, FileAccess.ReadWrite))

That being said, your design will eventually run up against memory limits for a sufficiently large number of sufficiently large files, since you load all of them into memory before serializing. If this is happening, somewhere in your production code you may be catching and swallowing the OutOfMemoryException without realizing it, leading to the problem you are seeing.

As an alternative, I would suggest a streaming solution where you incrementally copy each file's contents to the XML output from within XmlSerializer by making your Document class implement IXmlSerializable:

public class Document : IXmlSerializable
{
    public string DocumentPath { get; set; }

    public string DocumentTitle
    {
        get
        {
            if (DocumentPath == null)
                return null;
            return Path.GetFileName(DocumentPath);
        }
    }

    const string DocumentTitleName = "DocumentTitle";
    const string BinaryDocumentName = "BinaryDocument";

    #region IXmlSerializable Members

    System.Xml.Schema.XmlSchema IXmlSerializable.GetSchema()
    {
        return null;
    }

    void ReadXmlElement(XmlReader reader)
    {
        if (reader.Name == DocumentTitleName)
            DocumentPath = reader.ReadElementContentAsString();
    }

    void IXmlSerializable.ReadXml(XmlReader reader)
    {
        reader.ReadXml(null, ReadXmlElement);
    }

    void IXmlSerializable.WriteXml(XmlWriter writer)
    {
        writer.WriteElementString(DocumentTitleName, DocumentTitle ?? "");
        if (DocumentPath != null)
        {
            try
            {
                using (var stream = File.OpenRead(DocumentPath))
                {
                    // Write the start element if the file was successfully opened
                    writer.WriteStartElement(BinaryDocumentName);
                    try
                    {
                        var buffer = new byte[6 * 1024];
                        int read;
                        while ((read = stream.Read(buffer, 0, buffer.Length)) > 0)
                            writer.WriteBase64(buffer, 0, read);
                    }
                    finally
                    {
                        // Write the end element even if an error occurred while streaming the file.
                        writer.WriteEndElement();
                    }
                }
            }
            catch (Exception ex)
            {
                // You could log the exception as an element or as a comment, as you prefer.
                // Log as a comment
                writer.WriteComment("Caught exception with message: " + ex.Message);
                writer.WriteComment("Exception details:");
                writer.WriteComment(ex.ToString());
                // Log as an element.
                writer.WriteElementString("ExceptionMessage", ex.Message);
                writer.WriteElementString("ExceptionDetails", ex.ToString());
            }
        }
    }

    #endregion
}

// test class
public class DocumentContainer
{
    public List<Document> DocumentCollection { get; set; }
}

public static class XmlSerializationExtensions
{
    public static void ReadXml(this XmlReader reader, Action<IList<XAttribute>> readXmlAttributes, Action<XmlReader> readXmlElement)
    {
        if (reader.NodeType != XmlNodeType.Element)
            throw new InvalidOperationException("reader.NodeType != XmlNodeType.Element");

        if (readXmlAttributes != null)
        {
            var attributes = new List<XAttribute>(reader.AttributeCount);
            while (reader.MoveToNextAttribute())
            {
                attributes.Add(new XAttribute(XName.Get(reader.Name, reader.NamespaceURI), reader.Value));
            }
            // Move the reader back to the element node.
            reader.MoveToElement();
            readXmlAttributes(attributes);
        }

        if (reader.IsEmptyElement)
        {
            reader.Read();
            return;
        }

        reader.ReadStartElement(); // Advance to the first sub element of the wrapper element.

        while (reader.NodeType != XmlNodeType.EndElement)
        {
            if (reader.NodeType != XmlNodeType.Element)
                // Comment, whitespace
                reader.Read();
            else
            {
                using (var subReader = reader.ReadSubtree())
                {
                    while (subReader.NodeType != XmlNodeType.Element) // Read past XmlNodeType.None
                        if (!subReader.Read())
                            break;
                    if (readXmlElement != null)
                        readXmlElement(subReader);
                }
                reader.Read();
            }
        }

        // Move past the end of the wrapper element
        reader.ReadEndElement();
    }
}

Then use it as follows:

public static void SerializeFilesToXml(string directoryPath, string xmlPath)
{
    var docs = from file in Directory.GetFiles(directoryPath)
               select new Document { DocumentPath = file };
    var container = new DocumentContainer { DocumentCollection = docs.ToList() };

    using (var stream = File.Open(xmlPath, FileMode.Create, FileAccess.ReadWrite))
    using (var writer = XmlWriter.Create(stream, new XmlWriterSettings { Indent = true, IndentChars = " " }))
    {
        new XmlSerializer(container.GetType()).Serialize(writer, container);
    }

    Debug.WriteLine("Wrote " + xmlPath);
}

Using the streaming solution, when serializing 4 files of around 250 MB each, my memory use went up by 0.8 MB. Using the original classes, my memory went up by 1022 MB.

Update

If you need to write your XML to a memory stream, be aware that the c# MemoryStream has a hard maximum stream length of int.MaxValue (i.e. 2 GB) because the underlying memory is simply a byte array. On a 32-bit process the effective max length will be much smaller, see OutOfMemoryException while populating MemoryStream: 256MB allocation on 16GB system.

To programmatically check to see if your process is actually 32 bit, see How to determine programmatically whether a particular process is 32-bit or 64-bit. To change to 64 bit, see What is the purpose of the “Prefer 32-bit” setting in Visual Studio 2012 and how does it actually work?.

If you are sure you are running in 64 bit mode and are still exceeding the hard size limits of a MemoryStream, perhaps see alternative to MemoryStream for large data volumes or MemoryStream replacement?.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • It seems to work fine if I write to disk. However, I really need to write it to a memory stream. That appears to be when it fails. If I swap in a memorystream without any other changes, I start losing the large documents. – Devin Goble Mar 30 '16 at 21:22
  • @kettch - then this may be a new, unrelated question, as there is not in fact a property size limit with `XmlSerializer`. Even on a 64-bit process `MemoryStream` can hold no more than 2 GB of memory, because the underlying storage is a byte array. On a 32-bit process it's much less, see https://stackoverflow.com/questions/15595061/outofmemoryexception-while-populating-memorystream-256mb-allocation-on-16gb-sys. Make sure you're not running in 32 bit mode, see https://stackoverflow.com/questions/1953377 and https://stackoverflow.com/questions/12066638 – dbc Mar 30 '16 at 22:17