4

I tried but didn't function, I want to encode without BOM but with the option false still encoding in utf-8 with BOM.

Here is my code

System.Text.Encoding outputEnc = new System.Text.UTF8Encoding(false);
                return File(outputEnc.GetBytes(" <?xml version=\"1.0\" encoding=\"utf-8\"?>" + xmlString), "application/xml", id);
lumaluis
  • 113
  • 1
  • 2
  • 5
  • @DStanley: This question doesn't seem to be a duplicate. The accepted answer in the other question points out that `false` must be passed to the `UTF8Encoding` constructor, which is exactly what is done in this question. Hence, the other question doesn't help. Nominated for reopening. – O. R. Mapper Sep 09 '14 at 19:36
  • 1
    @O.R.Mapper Agree - I didn't catch that in the code sample. – D Stanley Sep 09 '14 at 19:39
  • 1
    How do you check wheter it is encoded with BOM? –  Sep 09 '14 at 19:39
  • @DStanley: To be clear: As that `false` *should* work, I suspect there's something else at work here; maybe the OP is somehow running an old version of their application. But as long as that's not confirmed, this question is different. – O. R. Mapper Sep 09 '14 at 19:54
  • I check with the notepad++ – lumaluis Sep 09 '14 at 20:49

1 Answers1

3

This question is more than two years old, but I've found the answer. The reason you were seeing a BOM in the output is because there's a BOM in your input. What appears to be a space at the start of your XML declaration is actually a BOM followed by a space. To prove it, select the text " < from your XML encoding (the opening double-quote, the space following it, and the opening < character) and paste that into any tool that tells you Unicode codepoints. For example, pasting that text into http://www.babelstone.co.uk/Unicode/whatisit.html gave me the following result:

U+0022 : QUOTATION MARK
U+FEFF : ZERO WIDTH NO-BREAK SPACE [ZWNBSP] (alias BYTE ORDER MARK [BOM])
U+0020 : SPACE [SP]
U+003C : LESS-THAN SIGN

You can also copy and paste from the " < that I put in this answer: I copied those characters from your question, so they contain the invisible BOM immediately before the space character.

This is why I often refer to the BOM as a BOM(b) -- because it sits there silently, hidden, waiting to blow up on you when you least expect it. You were using System.Text.UTF8Encoding(false) correctly. It didn't add a BOM, but the source that you copied and pasted your XML from contained a BOM, so you got one in your output anyway because you had one in your input.

Personal rant: It's a very good idea to leave BOMs out of your UTF-8 encoded text. However, some broken tools (Microsoft, I'm looking at you since you're the ones who made most of them) will misinterpret text if it doesn't contain a BOM, so adding a BOM to UTF-8 encoded text is sometimes necessary. But it should really be avoided as much as possible. UTF-8 is now the de facto default encoding for the Internet, so any text file whose encoding is unknown should be parsed as UTF-8 first, falling back to "legacy" encodings like Windows-1252, Latin-1, etc. only if parsing the document as UTF-8 fails.

rmunn
  • 34,942
  • 10
  • 74
  • 105
  • I'm sorry, but this rant is terrible advice in practice, especially on (but not limited to) Windows. It optimizes to avoid this rare case, at the cost of much more common cases, e.g. https://twitter.com/curlyquotefails . Reparsing an entire text is often impractical, and I haven't found many filesystems that reliably & affirmatively store the encoding external to the file data. – brianary Nov 07 '17 at 06:34
  • When I see the characteristic `’` (or other sequences) that show a UTF-8 character being misparsed as Latin-1, I don't think "Oh, they should have used a BOM to avoid the misparse". I think "Oh, there's a programmer who has failed Unicode 101". A quick Google search led me to [this list of encoding frequency on the Web](https://w3techs.com/technologies/overview/character_encoding/all), where UTF-8 is found as being used in 90% of sites. I don't know their survey methods, but on the Web, parsing as UTF-8 should *always* be the default. – rmunn Nov 07 '17 at 08:44
  • Also, on a personal-experience level, I tend to find BOM errors harder to track down because they're invisible. Incorrectly-encoded UTF-8 stands out immediately, so the bug tends to be noticed & fixed quickly. So although I think I agree with you that BOM errors are rarer than wrong-encoding errors, I still maintain that avoiding the BOM is the best default behavior, because you'll be avoiding invisible, hard-to-track-down errors in favor of immediately-visible errors that are much easier to spot and fix. (Though granted, sometimes those errors are in *other people's code* that you can't fix.) – rmunn Nov 07 '17 at 08:46
  • Parsing based on statistical analysis of the data or quick assumptions isn't working, on a massive scale. It is also much slower and more expensive. Why blame the programmer for data pasted from Word, or for platforms, frameworks, libraries, and tools of varying ages and quality, when you could just as easily blame them for not handling a BOM (signature) correctly. Why optimize for the rare case with less user impact? Especially when anything involving Windows does not reliably handle text without a BOM (see the old "bush hid the facts" bug). – brianary Nov 07 '17 at 14:08