I have a really weird issue here: I'm building an interface to a third-party system which provides XML files (with a UTF-8 encoding) over a SFTP server.
I download those files in my C# code, and then I try to deserialize them into a C# object. For most of the files, this works quite nicely, but for some, it just keeps bombing out....
Imagine a DTO class like this:
public class Person
{
public string FirstName { get; set; }
public string LastName { get; set; }
public int Age { get; set; }
}
and a XML like this:
<?xml version="1.0" encoding="utf-8"?>
<Person>
<FirstName>John</FirstName>
<LastName>Doe</LastName>
<Age>42</Age>
</Person>
What I'm doing on my side in my C# code is:
- download the file content as a byte array from the SFTP server
- extract an UTF-8 encoded string from that binary data
- use that string representation for the deserialization process
Something like this:
// get bytes from SFTP server
byte[] content = _sftpClient.Download(fileName);
// convert content to a UTF-8 string
string contentAsString = Encoding.UTF8.GetString(content);
try
{
// deserialize that string into a "Person" instance
XmlReaderSettings settings = new XmlReaderSettings();
settings.IgnoreComments = true;
settings.IgnoreProcessingInstructions = true;
settings.IgnoreWhitespace = true;
settings.CheckCharacters = false;
using (StringReader str = new StringReader(contentAsString))
using (XmlReader xr = XmlReader.Create(str, settings))
{
XmlSerializer ser = new XmlSerializer(typeof(Person));
if (ser.CanDeserialize(xr))
{
Person person = ser.Deserialize(xr) as Person;
}
}
}
catch (Exception exc)
{
Console.WriteLine("ERROR: {0} - {1}", exc.GetType().Name, exc.Message);
}
Now I analyzed the files that worked, and those that didn't - and the difference is a three-byte prefix in the binary data (0xEF 0xBB 0xBF
) - the "Unicode BOM" (Byte-Order Mark).
I am aware of that BOM, and that's the reason I'm not using the binary data fetched from the SFTP Server directly. When I convert those types of files into the XML string contentAsString
, this string appears to be identical - at least I can't see any difference at all.
But the files with the 3-byte BOM at the beginning (in the binary data) cause the deserialization to fail on this line
if (ser.CanDeserialize(xr))
with an error:
SystemException: Data at the root level is invalid. Line 1, position 1.
But how on earth does the string know / "preserve" that information about the 3-byte BOM? I was expecting that by turning the byte array into a UTF-8 encoded string, any differences would go away and the BOM should no longer be relevant...
Any ideas on how I can reliably deal with both files with or without the 3-byte BOM?