1

I have an XML file with a UTF-8 BOM in the beginning of the file, which hinders me from using existing code that reads UTF-8 files.

How can I remove the BOM from the XML file in an easy way?

Here I have a variable xmlfile in Byte type that I convert to string. xmlfile contains the entire XML file.

 byte[] xmlfile = ((Byte[])myReader["xmlSQL"]);

 string xmlstring = Encoding.UTF8.GetString(xmlfile);
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Beefybanana
  • 63
  • 1
  • 8
  • The code you've shown doesn't use `XMLReader` at all - does that code fail, or is it some of the code you haven't shown us? What does the exception look like? I'd expect XMLReader to handle BOMs anyway... – Jon Skeet Nov 08 '21 at 15:41
  • sorry good question. no xmlreader is just part a function that reads over the content of the xml file to find namespaces. that works fine, my problem is that i cant read utf-8bom files. because of these character infront of the file. so i need to remove those so i can use xmlreader. So its either with xmlfile as byte or xmlstring as a string to remove BOM – Beefybanana Nov 08 '21 at 15:46
  • 2
    Please edit your question to make it *much* clearer. Ideally, provide a [mcve]. "i cant read utf-8bom files" really doesn't give us *nearly* enough information about the error you're facing. See https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/ for suggestions on how to write a good question. – Jon Skeet Nov 08 '21 at 15:46
  • 2
    Don't use `Encoding.UTF8.GetString`, instead use a `StreamReader`, it consumes the BOM automatically. as shown in [Encoding.UTF8.GetString doesn't take into account the Preamble/BOM](https://stackoverflow.com/q/11701341/3744182) and [How do I ignore the UTF-8 Byte Order Marker in String comparisons?](https://stackoverflow.com/a/2915239/3744182). Even better, you could pass the `StreamReader` directly to the `XmlReader` and avoid the waste of the intermediate `xmlstring` representation. Or pass a `MemoryStream` containing the bytes to the `XmlReader` that should also consume the BOM. – dbc Nov 08 '21 at 15:47
  • the XML file is saved as xmlfile and later converted to xmlstring as a string. can you remove the BOM characters from either ? – Beefybanana Nov 08 '21 at 15:48
  • Use `myReader.GetXmlReader`. And don't forget to dispose everything with `using`. This is obviously assuming you are actually storing it in the database as `xml` type, because you are, right.....? – Charlieface Nov 08 '21 at 17:17
  • Great stuff DBC :) that worked well ! To fix my problem i just added the content to memory stream and then streamreader. and it removed the BOM. – Beefybanana Nov 08 '21 at 19:19
  • @Beefybanana - glad to help. Close as a duplicate of [Encoding.UTF8.GetString doesn't take into account the Preamble/BOM](https://stackoverflow.com/q/11701341/3744182) then? – dbc Nov 08 '21 at 19:23
  • i should close it ? i did change the text. seems very readable now – Beefybanana Nov 08 '21 at 19:37

2 Answers2

2

Great stuff DBC :) that worked well with your link. To fix my problem where i had a UTF-8 BOM tag in the beginning of my xml file. I simply added memorystream and streamreader, which automaticly cleanced the the xmlfile(htmlbytes) of BOM elements. Really easy to implement for existing code.

 byte[] htmlbytes = ((Byte[])myReader["xmlMelding"]);
 var memorystream = new MemoryStream(htmlbytes);
 var s = new StreamReader(memorystream).ReadToEnd();
Beefybanana
  • 63
  • 1
  • 8
1

Encoding.GetString() has an overload that accepts an offset into the byte[] array. Simply check if the array starts with a BOM, and if so then skip it when calling GetString(), eg:

byte[] xmlfile = ((Byte[])myReader["xmlSQL"]);
int offset = 0;

if (xmlfile.Length >= 3 &&
    xmlfile[0] == 0xEF &&
    xmlfile[1] == 0xBB &&
    xmlfile[2] == 0xBF)
{
    offset += 3;
}

string xmlstring = Encoding.UTF8.GetString(xmlfile, offset, xmlfile.Length - offset);
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770