-2

Well, when using IO.File.ReadAllText(path) or ReadAllText(path, System.Text.Encoding.UTF8) to read a text file which is saved in ANSI encoding, non-latin characters aren't displayed correctly.

So, I decided to use Encoding.Default. It worked just fine, but I see recommendations against using it everywhere (like here and here) because it "will only guarantee that all UTF-7 character sets will be read correctly". Also Microsoft says:

Gets an encoding for the operating system's current ANSI code page.

However, it seems to me that it can recognize a file with any encoding. I tested that on a file that contains Chinese, Japanese, and Arabic characters -the file is saved in utf8 encoding-, and I was able to display the file correctly.

Code used:

Dim loadedText As String = IO.File.ReadAllText(path, System.Text.Encoding.Default)
MessageBox.Show(loadedText, "utf8")

Output:

output

So my question in points:

  • Is there something I'm missing here?
  • Why is it not recommended to use Encoding.Default when reading a file? I know that a file with ANSI encoding would be displayed incorrectly if the default system encoding/system locale is changed, which is something I don't care about in my current case. But..
  • Is there even another way to prevent this from happening?

Side note: Please don't mind me using the c# tag. Although my code is in VB, any answer with C# code is welcomed.

Community
  • 1
  • 1
  • You cannot save non-latin characters in an ANSI file. You can see a table listing all of the possible characters saved here: https://en.wikipedia.org/wiki/Windows-1252, beyond that, everything else will get lost or translated into something else. – David Mar 06 '17 at 18:44
  • I think the 2nd answer from your first link explains it pretty well why it is not reccomended http://stackoverflow.com/a/6006451/80274 – Scott Chamberlain Mar 06 '17 at 18:48
  • You must read the file in the same encoding the file is saved in. Otherwise you risk having bad data. The notepad "ANSI" encoding is also called ASCII or Windows-1252. – David Mar 06 '17 at 18:50
  • @David, my question is about reading *not saving* the data from a file. – 41686d6564 stands w. Palestine Mar 06 '17 at 18:52
  • @ScottChamberlain If the file is in ANSI encoding, yes it will change and I don't mind that. But how about if the file uses other encoding (e.g. utf8), would using `Encoding.Default` be a problem? Because it looks like it was able to recognize the characters just fine. Am I missing something? – 41686d6564 stands w. Palestine Mar 06 '17 at 18:55
  • 1
    Show how you wrote the file. ReadAllText will attempt to determine the encoding from the BOM, if you wrote the file with a BOM it won't matter what encoding you pass into ReadAllText. If you used `Encoding.UTF8`, that by default includes a BOM. – Mike Zboray Mar 06 '17 at 19:03
  • @mikez, you are absolutely right. When I removed the BOM from the file, I wasn't able to display the characters correctly using `Encoding.Default`. Please feel free to add this to an answer. Thank you, that really helped and cleared the confusion. – 41686d6564 stands w. Palestine Mar 06 '17 at 19:12

2 Answers2

2

File.ReadAllText actually tries to auto-detect the encoding. If the encoding cannot be determined from a BOM, then the encoding argument is used to decode the file.

This method attempts to automatically detect the encoding of a file based on the presence of byte order marks. Encoding formats UTF-8 and UTF-32 (both big-endian and little-endian) can be detected.

If you used Encoding.UTF8 to write the file, then it would include a BOM. Your Encoding.Default is likely being ignored.

Mike Zboray
  • 39,828
  • 3
  • 90
  • 122
0

Using Encoding.Default is not recommended because it is operating system's ANSI code page, which is limited to given code page's character set. In other words, text file created in Notepad (ANSI encoding) in Czech Windows will be displayed incorrectly in English Windows. For this reason, everything should be saved and opened in UTF-8 encoding.

  • Saved in ANSI and opened in Unicode may not work
  • Saved in Unicode and opened in ANSI will not work
  • Saved in ANSI and opened in another ANSI may not work
Ondřej
  • 1,645
  • 1
  • 18
  • 29