Parsing and removing BOM/Preamble from XML via filesystem

Question

I am processing XBRL files, and ran in to a bunch of them that have a Byte-Order-Mark (BOM) at the start. If I manually remove it, I can process the file without any issue.

I've had several failed attempts to remove the BOM from the start of the XML files that I am reading from.

This is the error message I am receiving:

Data at the root level is invalid. Line 1, position 1.

Originally I was using XDocument.Load(filename) but this was failing with the same error, so I modified the code after gaining advice from Parsing xml string to an xml document fails if the string begins with <?xml... ?> section without success.

void Main()
{
    XDocument doc;
    var @filename = @"C:\accounts\toprocess\2008\Prod224_8998_00741575_20080630.xml";
    byte[] file = File.ReadAllBytes(filename);
    using (MemoryStream memory = new MemoryStream(file))
    {
        using (XmlTextReader oReader = new XmlTextReader(memory))
        {
            doc = XDocument.Load(oReader);
        }
    }
}

The XML file can be found here: http://s000.tinyupload.com/download.php?file_id=92333278767554773703&t=9233327876755477370347742

All you have to do is read six bytes from memory stream : byte[] buffer = new byte[6]; memory.Read(buffer, 0, 6); — jdweng, Feb 26 '19 at 10:05
Your XML file is corrupt, probably due to a bug that misused character encodings. It may have had a BOM that got corrupted. It doesn't have one anymore. It might be that was the only corruption but who knows? Send it back. Get the upstream process fixed. — Tom Blodget, Feb 26 '19 at 17:56

score 3 · Accepted Answer · answered Feb 26 '19 at 09:50

C3 AF C2 BB C2 BF looks to be a double UTF-8 encoded BOM. UTF-8 encoding of the BOM is EF BB BF. If you were to treat each of those as a separate character and UTF-8 encode, you'd end up with the sequence that you're seeing.

So the document you have is broken. Something is taking a document containing a UTF-8 BOM and treating it as extended ASCII. If you can't get the documents fixed at source, I'd be inclined to look for that specific sequence at the start of the file and strip it if present.

If the documents in question use other extended ASCII characters, there's a good chance they'll be broken too.

Nice find. Seen it before (-: ? You can solve the "other extended ASCII characters" by using a `StreamReader(stream, Encoding.UTF8)` — H H, Feb 26 '19 at 09:58

score 2 · Answer 2 · answered Feb 26 '19 at 08:58

2

The sequence C3 AF C2 BB C2 BF does not look like any BOM.

You probably should investigate what it is, if it is consistent (in length) etc.

As it is, you can simply skip the first 6 bytes:

using (var stream = File.Open(fileName, FileMode.Open))
{
    stream.Seek(6, SeekOrigin.Begin);
    var doc = XDocument.Load(stream);
    // ...use it
}

answered Feb 26 '19 at 08:58

H H

263,252
30
330
514

Adding stream.Seek(6, SeekOrigin.Begin); certainly allows me to skip that sequence, and parse the affected XML files. Any idea how I would check for that particular sequence in code? – David Wilson Feb 26 '19 at 10:03
2

You can look at this answer: https://stackoverflow.com/questions/43289/comparing-two-byte-arrays-in-net Read 6 bytes, when there is no match: do a `Seek(0)` – H H Feb 26 '19 at 10:08

Parsing and removing BOM/Preamble from XML via filesystem

2 Answers2