2

I am writing a some text into file using the FileWriter object. I am specifying that I want the output to be in UTF-8,but when i open the text file and go to save as , I see that it is in ANSI encoding.

I want to also add that when there are characters other than the standard ascii charset (eg:- japansese character) the file encoding is UTF-8, but without then text file encoding is ANSI.

File json_file= new File(path);
FileWriter json_file_output=newFileWriter(json_file,StandardCharsets.UTF_8);
json_file_output.write("SOME JSON TEXT HERE");
json_file_output.flush();

I am not sure whether it is due to java code or notepad.

Thank you for the help.

SuperAadi
  • 627
  • 1
  • 6
  • 15
  • 5
    The text `SOME JSON TEXT HERE` is encoded identical in both UTF-8 and ASCII. Show us your *real* input and the result. –  Aug 11 '19 at 06:58
  • Does notepad auto detect the encoding? – SuperAadi Aug 11 '19 at 07:02
  • I think it is your notepad issue (Maybe it is default to ASCII code). I ran the code with non-ascii (`x√ab c`) characters and the file was written correctly. – Yoshikage Kira Aug 11 '19 at 07:04
  • I know it will write all characters correctly ,but with just ascii characters in the output and when I see 'save as' the encoding is ANSI ,eventhough I specified the output to be in Unicode. – SuperAadi Aug 11 '19 at 07:07
  • 5
    That's just the behaviour of Notepad. You *are* writing out UTF-8. Try writing out text with non-ASCII characters and you'll see that. Every file which only contains ASCII characters is the same whether encoded in ASCII or UTF-8. – Jon Skeet Aug 11 '19 at 07:09

1 Answers1

3

Unicode is superset of US-ASCII character set,
UTF-8 is superset of 8-bit US-ASCII character encoding

There is no such thing as ANSI encoding. See What is ANSI format?.

Likely what is meant is US-ASCII. And every 8-bit US-ASCII file is also a UTF-8 file. Unicode is a superset of US-ASCII. When written out using octets, ASCII files are UTF-8 files. UTF-8 encoding was designed this way on purpose, to be compatible.

US-ASCII is a 7-bit character set, having only 128 characters, numbered 0-127. So if written using octets (8-bits), the first bit of every octet is a zero. See the Wikipedia page on UTF-8 encoding, and notice the role played by the first bit.

No file meta-data

Understand that both US-ASCII files and UTF-8 files (without a BOM) are just a bunch of bits, with no meta-data. The computer industry never managed to establish a standard for file system meta-data, unfortunately. So an app has to guess the content’s content, or the user must indicate the expected format.

Your text editor is likely looking at the domain of characters found in your file, and then trying to be helpfully conservative in labeling the file using the smallest-scope encoding possible. If only US-ASCII characters, then label as US-ASCII (and apparently misreport as “ANSI”). As soon as you add higher-numbered characters with a code point beyond that of ASCII, then label as UTF-8.


For background info, such as the distinction between character set and character encoding, see: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
  • In Windows, "ANSI encoding" is [a real thing](https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp): the selected character encoding from several standard character encodings (ANSI and/or IANA or other). And almost certainly not US-ASCII. – Tom Blodget Aug 12 '19 at 15:50