6

I'm using the following code:

File.WriteAllBytes("c:\\test.xml", Encoding.UTF8.GetBytes("THIS IS A TEST"))

Which should in theory write a UTF8 file, but I just get an ANSI file. I also tried this just to be especially verbose;

File.WriteAllBytes("c:\\test.xml", ASCIIEncoding.Convert(ASCIIEncoding.ASCII, UTF8Encoding.UTF8, Encoding.UTF8.GetBytes("THIS IS A TEST")))

Still the same issue though.

I am testing the outputted files by loading in TextPad which reads the format correctly (I tested with a sample file as I know these things can be a bit weird sometimes)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Tony Cheetham
  • 877
  • 7
  • 18
  • If your string only contains ASCII, then ANSI and UTF8 are interchangable. Try adding some accented characters into the string and then see what happens. – Patrick Roberts Apr 10 '18 at 16:27
  • Very unclear what do you mean... Can you please explain what "UTF8 file" means to you? – Alexei Levenkov Apr 10 '18 at 16:27
  • File.WriteAllBytes does *not* write a BOM that identifies the file as containing utf8 encoded text. Consider StreamWriter instead. – Hans Passant Apr 10 '18 at 16:34
  • Alexei - I open the outputted file in Textpad, and it tells me what encoding the file is in, and it shows ANSI for the output from writeallbytes. If I save that file in textpad/notepad as utf-8, then re-load it, it shows as utf-8. The desired output would be the utf-8 formatted file. – Tony Cheetham Apr 10 '18 at 16:34
  • In case anyone wanders across this, and wants to write a UTF-8 string directly to a file with the BOM intact, you should use the preamble to generate the file header and merge it with your string. File.WriteAllBytes("c:\\test.xml", Encoding.UTF8.GetPreamble().Concat(Encoding.UTF8.GetBytes("THIS IS A TEST")).ToArray()); – Tony Cheetham Apr 10 '18 at 17:26
  • 1
    @tonyenkiducx questionable suggestion: if you need to write text - `File.WriteAllText(@"c:\temp.txt", "test", Encoding.UTF8);` is much easier, if you need to write XML - use XML classes. (it is very unlikely for average person to correctly construct XML with string concatenation/manual writing... and it would be much easier for others to understand code if regular .Net XML classes are used) – Alexei Levenkov Apr 10 '18 at 18:18
  • "Textpad…tells me what encoding the file is in". No, a program cannot tell you what character encoding a text file was written with. Only the writer knows and whoever is told what the writer tells. A program can rule out some encodings, make probabilistic measurements and add the author's own preference, including the inexplicably strong preference for saying ANSI when UTF-8 is, by the contents, equally likely. – Tom Blodget Apr 10 '18 at 20:51
  • @AlexeiLevenkov I specifically mentioned write by bytes, because that is what I need. – Tony Cheetham Apr 12 '18 at 12:34

1 Answers1

8

WriteAllBytes isn't ignoring the encoding - rather: you already did the encoding, when you called GetBytes. The entire point of WriteAllBytes is that it writes bytes. Bytes don't have an encoding; rather: encoding is the process of converting from text (string here) to bytes (byte[] here).

UTF-8 is identical to ASCII for all ASCII characters - i.e. 0-127. All of "THIS IS A TEST" is pure ASCII, so the UTF-8 and ASCII for that are identical.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 2
    @AlexeiLevenkov in my entire programming career, I think I've seen someone *actually use* a UTF-8 BOM about *twice*, and at least one of them (possibly both) was an error that was causing a bug because the consuming code didn't expect it :) But yes, you could be write, and there *is* such a thing - `new UTF8Encoding(true).GetPreamble()` – Marc Gravell Apr 10 '18 at 16:30
  • Perhaps I need to re-word my question.. I need to create a MemoryStream that contains a UTF-8 encoded string, I was using writeallbytes to test the output. I assumed when encoding to byte and then saving to a memory stream the string would retain it's UTF8 formatting, but it doesn't seem to. I'll write another question, as this seems like a good one to leave here. – Tony Cheetham Apr 10 '18 at 16:36
  • Also `Encoding.UTF8.GetPreamble()` gives the BOM sequence. However, `(new UTF8Encoding()).GetPreamble()` gives an empty (length zero) array. – Jeppe Stig Nielsen Apr 10 '18 at 17:04
  • That makes sense @JeppeStigNielsen. File.WriteAllBytes("c:\\test.xml", Encoding.UTF8.GetPreamble().Concat(Encoding.UTF8.GetBytes("THIS IS A TEST")).ToArray()); – Tony Cheetham Apr 10 '18 at 17:27
  • @tonyenkiducx I think a key question here is: what bytes did you expect? Because I would expect 14 bytes that happen to be identical to the ASCII. – Marc Gravell Apr 11 '18 at 03:42