0

ANSI seems limited compared to say UTF-8, yet it's the default file encoding in Notepad++, so I was wondering.

Emanuil Rusev
  • 34,563
  • 55
  • 137
  • 201
  • people using anything else than ASCII in source code should be shot and fired. You may think ANSI or UTF-8 would make sense but it doesn't, unless the language specs specifies the encoding. Strings and whatnots **must** be externalized or your codebase is a joke. Many will disagree but in a mixed OS / IDE / "text editor" etc. environment you are **begging** for big troubles if your source code aren't only ASCII. I've myself written scripts that make build fail if **ANY** source file isn't ASCII for language that do not mandate a particular file encoding. – SyntaxT3rr0r Aug 20 '11 at 23:11
  • and honestly when I see people having *"source file editing / parsing / build scripts"* issues related to file encoding, I don't know if I should laugh or cry. The root of the issue is simple: if the language doesn't specify the encoding you're toast if you use **anything** but ASCII because there's no metadata. Simple as that. – SyntaxT3rr0r Aug 20 '11 at 23:14
  • note that some languages, like Google's Go if I'm not mistaken, do specify in the spec (!) UTF-8 as being the mandatory file encoding. In that case and in that case only non-ASCII is fine. But you don't get to "choose" between ANSI or UTF-8 or EBCDIC: you use what the language spec specify. – SyntaxT3rr0r Aug 20 '11 at 23:16
  • 1
    See also: http://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences – Kev Aug 21 '11 at 11:51
  • What the heck is this ANSI thing you’re muttering about? No such thing. As for @SyntaxT3rr0r, yes this is yet another of Java’s million horrible misdesign errors, that you must rely on some external metadata to describe the internal file encoding. More modern approaches can be found in XML, Perl, Ruby, and Java, all of which have a default but let you override that with something internal to the file so that the metadata describing the file’s encoding can never be lost. – tchrist Aug 26 '11 at 18:00

3 Answers3

3

Well, if you can encode everything in ANSI (whatever ANSI happens to mean on your computer; it's a horribly ambiguous term), then it may be shorter in UTF-8. For non-ASCII characters, ANSI encodings can still encode each character in a single byte, whereas they'll take more bytes in UTF-8.

It's a tiny advantage though, and the disadvantages are significant IMO - I would definitely go with UTF-8.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
1

Strictly speaking, a "benefit" might be that less bytes might be taken up by using it, since there are many characters that are encoded in one byte of ANSI and two to three in UTF-8. For example, the florin, mdash, ndash, the times symbol, and some accented roman letters.

There are native operations in the Windows API that might be a hair faster.

You give up a lot though, in restricting yourself to 256 characters as opposed to UTF-8's one million plus.

Ray Toal
  • 86,166
  • 18
  • 182
  • 232
1

Expading on Jon's answer:

Space requirements for UTF-8 encoding, as extracted Wikipedia's UTF-8 article and formatted/combined slightly:

  1. So the first 128 (range [0, 0x7f]) characters (US-ASCII) need one byte.
  2. The next 1,920 (range [0x80,0x07ff]) characters need two bytes to encode. ...
  3. Three bytes are needed for the rest (range [0x0800,0xffff]) of the Basic Multilingual Plane (which contains virtually all characters in common use).
  4. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.

Looking at an ANSI to Unicode mapping it can be seen that half the ANSI characters (the ASCII set) aligns with Unicode (1 byte encoding), a number of the values over 127 also fall within the [0,0x7FF] Unicode range (2 bytes), and there are less common values which map into Unicode at over 0x07ff (require 3 bytes to encode in UTF-8).

Now, as for why that is the default encoding -- talk to the Notepad++ author :)

Happy coding.