UTF-8 remove BOM

Question

I have an XML file with a UTF-8 BOM in the beginning of the file, which hinders me from using existing code that reads UTF-8 files.

How can I remove the BOM from the XML file in an easy way?

Here I have a variable xmlfile in Byte type that I convert to string. xmlfile contains the entire XML file.

 byte[] xmlfile = ((Byte[])myReader["xmlSQL"]);

 string xmlstring = Encoding.UTF8.GetString(xmlfile);

The code you've shown doesn't use `XMLReader` at all - does that code fail, or is it some of the code you haven't shown us? What does the exception look like? I'd expect XMLReader to handle BOMs anyway... — Jon Skeet, Nov 08 '21 at 15:41
sorry good question. no xmlreader is just part a function that reads over the content of the xml file to find namespaces. that works fine, my problem is that i cant read utf-8bom files. because of these character infront of the file. so i need to remove those so i can use xmlreader. So its either with xmlfile as byte or xmlstring as a string to remove BOM — Beefybanana, Nov 08 '21 at 15:46
Please edit your question to make it *much* clearer. Ideally, provide a [mcve]. "i cant read utf-8bom files" really doesn't give us *nearly* enough information about the error you're facing. See https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/ for suggestions on how to write a good question. — Jon Skeet, Nov 08 '21 at 15:46
Don't use `Encoding.UTF8.GetString`, instead use a `StreamReader`, it consumes the BOM automatically. as shown in [Encoding.UTF8.GetString doesn't take into account the Preamble/BOM](https://stackoverflow.com/q/11701341/3744182) and [How do I ignore the UTF-8 Byte Order Marker in String comparisons?](https://stackoverflow.com/a/2915239/3744182). Even better, you could pass the `StreamReader` directly to the `XmlReader` and avoid the waste of the intermediate `xmlstring` representation. Or pass a `MemoryStream` containing the bytes to the `XmlReader` that should also consume the BOM. — dbc, Nov 08 '21 at 15:47
the XML file is saved as xmlfile and later converted to xmlstring as a string. can you remove the BOM characters from either ? — Beefybanana, Nov 08 '21 at 15:48
Use `myReader.GetXmlReader`. And don't forget to dispose everything with `using`. This is obviously assuming you are actually storing it in the database as `xml` type, because you are, right.....? — Charlieface, Nov 08 '21 at 17:17
Great stuff DBC :) that worked well ! To fix my problem i just added the content to memory stream and then streamreader. and it removed the BOM. — Beefybanana, Nov 08 '21 at 19:19
@Beefybanana - glad to help. Close as a duplicate of [Encoding.UTF8.GetString doesn't take into account the Preamble/BOM](https://stackoverflow.com/q/11701341/3744182) then? — dbc, Nov 08 '21 at 19:23
i should close it ? i did change the text. seems very readable now — Beefybanana, Nov 08 '21 at 19:37

score 2 · Answer 1 · answered Nov 08 '21 at 19:25

Great stuff DBC :) that worked well with your link. To fix my problem where i had a UTF-8 BOM tag in the beginning of my xml file. I simply added memorystream and streamreader, which automaticly cleanced the the xmlfile(htmlbytes) of BOM elements. Really easy to implement for existing code.

 byte[] htmlbytes = ((Byte[])myReader["xmlMelding"]);
 var memorystream = new MemoryStream(htmlbytes);
 var s = new StreamReader(memorystream).ReadToEnd();

Remy Lebeau · Answer 2 · 2022-12-27T21:34:56.647

1

Encoding.GetString() has an overload that accepts an offset into the byte[] array. Simply check if the array starts with a BOM, and if so then skip it when calling GetString(), eg:

byte[] xmlfile = ((Byte[])myReader["xmlSQL"]);
int offset = 0;

if (xmlfile.Length >= 3 &&
    xmlfile[0] == 0xEF &&
    xmlfile[1] == 0xBB &&
    xmlfile[2] == 0xBF)
{
    offset += 3;
}

string xmlstring = Encoding.UTF8.GetString(xmlfile, offset, xmlfile.Length - offset);

edited Dec 27 '22 at 21:34

answered Nov 09 '21 at 00:36

Remy Lebeau

555,201
31
458
770

1

**change last byte** _xmlfile[1] == 0xBF_ to **xmlfile[2] == 0xBF** just work fine. Thanks – stefmex Dec 27 '22 at 18:18

UTF-8 remove BOM

2 Answers2