Deserializing UTF-8 encoded XML fails when BOM is present

Question

I have a really weird issue here: I'm building an interface to a third-party system which provides XML files (with a UTF-8 encoding) over a SFTP server.

I download those files in my C# code, and then I try to deserialize them into a C# object. For most of the files, this works quite nicely, but for some, it just keeps bombing out....

Imagine a DTO class like this:

public class Person
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public int Age { get; set; }
}

and a XML like this:

<?xml version="1.0" encoding="utf-8"?>
<Person>
    <FirstName>John</FirstName>
    <LastName>Doe</LastName>
    <Age>42</Age>
</Person>

What I'm doing on my side in my C# code is:

download the file content as a byte array from the SFTP server
extract an UTF-8 encoded string from that binary data
use that string representation for the deserialization process

Something like this:

// get bytes from SFTP server
byte[] content = _sftpClient.Download(fileName);

// convert content to a UTF-8 string
string contentAsString = Encoding.UTF8.GetString(content);

try
{
    // deserialize that string into a "Person" instance
    XmlReaderSettings settings = new XmlReaderSettings();
    settings.IgnoreComments = true;
    settings.IgnoreProcessingInstructions = true;
    settings.IgnoreWhitespace = true;
    settings.CheckCharacters = false;

    using (StringReader str = new StringReader(contentAsString))
    using (XmlReader xr = XmlReader.Create(str, settings))
    {
        XmlSerializer ser = new XmlSerializer(typeof(Person));

        if (ser.CanDeserialize(xr))
        {
            Person person = ser.Deserialize(xr) as Person;
        }
    }
}
catch (Exception exc)
{
    Console.WriteLine("ERROR: {0} - {1}", exc.GetType().Name, exc.Message);
}

Now I analyzed the files that worked, and those that didn't - and the difference is a three-byte prefix in the binary data (0xEF 0xBB 0xBF) - the "Unicode BOM" (Byte-Order Mark).

I am aware of that BOM, and that's the reason I'm not using the binary data fetched from the SFTP Server directly. When I convert those types of files into the XML string contentAsString, this string appears to be identical - at least I can't see any difference at all.

But the files with the 3-byte BOM at the beginning (in the binary data) cause the deserialization to fail on this line

if (ser.CanDeserialize(xr))

with an error:

SystemException: Data at the root level is invalid. Line 1, position 1.

But how on earth does the string know / "preserve" that information about the 3-byte BOM? I was expecting that by turning the byte array into a UTF-8 encoded string, any differences would go away and the BOM should no longer be relevant...

Any ideas on how I can reliably deal with both files with or without the 3-byte BOM?

Possible duplicate - check out this answer: http://stackoverflow.com/questions/3104158/xmlreader-breaks-on-utf-8-bom — hoodaticus, Feb 01 '17 at 20:26
@hoodaticus it's not quite a duplicate - yes, the situation seems to be almost the same - but the accepted response is actually what I'm doing already, and it's **not working** for me (causes an exception if the BOM is present). That proposed solution is not solving my issue, unfortunately..... — marc_s, Feb 01 '17 at 20:33

score 4 · Accepted Answer · answered Feb 01 '17 at 21:10

Instead of creating a string from the byte[] and use that as the input for the XmlReader, use a MemoryStream:

        // get bytes from SFTP server
        byte[] content = _sftpClient.Download(fileName);

        try
        {
            XmlReaderSettings settings = new XmlReaderSettings();
            settings.IgnoreComments = true;
            settings.IgnoreProcessingInstructions = true;
            settings.IgnoreWhitespace = true;
            settings.CheckCharacters = false;

            using(var memoryStream = new MemoryStream(content))
            using (XmlReader xr = XmlReader.Create(memoryStream, settings))
            {
                XmlSerializer ser = new XmlSerializer(typeof(Person));

                if (ser.CanDeserialize(xr))
                {
                    Person person = ser.Deserialize(xr) as Person;
                }
            }
        }
        catch (Exception exc)
        {
            Console.WriteLine("ERROR: {0} - {1}", exc.GetType().Name, exc.Message);
        }

Thanks - I'm amazed this works - I was **convinced** I needed to convert the byte array to a string to get rid of the differences with or without the BOM - but in the end, it's actually the other way around - using the byte stream directly works, using the "intermediary" string doesn't ..... but it works - so thanks a heap ! — marc_s, Feb 01 '17 at 21:29

Deserializing UTF-8 encoded XML fails when BOM is present

1 Answers1