37

How important is it to save your source code in UTF-8 format?

Eclipse on Windows uses CP1252 character encoding by default. The CP1251 format means non UTF-8 characters can be saved and I have seen this happen if you copy and paste from a Word document for a comment.

The reason I ask is because out of habit I set-up Maven encoding to be in UTF-8 format and recently it has caught a few non mappable errors.

(update) Please add any reasons for doing so and why, are there some common gotchas that should be known?

(update) What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer, right now I don't.

JARC
  • 5,288
  • 8
  • 38
  • 43
  • 1
    Non-UTF-8 characters? If CP1251 really has those then I'd rather not have them in source code. – starblue Feb 01 '10 at 18:48
  • 1
    UTF-8 can encode ALL of the characters that Java can use (Unicode). This table seems to imply that every character in CP1251 can be mapped to a Unicode character. I don't know what "non mappable errors" except possibly if Maven is using an internal, more restrictive character set. http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1251.TXT – AgilePro Mar 01 '13 at 15:37

5 Answers5

27

What is your goal? Balance your needs against the pros and cons of this choice.

UTF-8 Pros

  • allows use of all character literals without \uHHHH escaping

UTF-8 Cons

  • using non-ASCII character literals without \uHHHH increases risk of character corruption
    • font and keyboard issues can arise
    • need to document and enforce use of UTF-8 in all tools (editors, compilers build scripts, diff tools)
  • beware the byte order mark

ASCII Pros

  • character/byte mappings are shared by a wide range of encodings
    • makes source files very portable
    • often obviates the need for specifying encoding meta-data (since the files would be identical if they were re-encoded as UTF-8, Windows-1252, ISO 8859-1 and most things short of UTF-16 and/or EBCDIC)

ASCII Cons

  • limited character set
  • this isn't the 1960s

Note: ASCII is 7-bit, not "extended" and not to be confused with Windows-1252, ISO 8859-1, or anything else.

McDowell
  • 107,573
  • 31
  • 204
  • 267
  • What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer - thanks for the post. – JARC Feb 01 '10 at 17:26
  • 2
    There is only one good reason to store sources as UTF-8: if you comment in a language that needs non-ASCII characters. For UI/messages the strings should be stored in some kind of resource files/message catalogs. Good internationalization practice. – Mihai Nita Feb 03 '10 at 09:18
  • 1
    UTF-8 does not use a byte order mark. While it can use multiple bytes to represent a single Unicode code point, it is not a multibyte character set. UTF-16 uses two bytes (or four with a surrogate) so byte order is relevant there. Think of it this way. UTF-8 "consumes" one byte at a time from an input stream, possibly consuming multiple bytes in succession to put together a code point. UTF-16 consumes two bytes at a time, so the order matters. –  Oct 15 '14 at 21:03
  • @Snowman While its true that UTF-8 doesn't *use* a byte order mark, it still has one: \uEFBBBF (yes, the byte order mark for UTF-8 is longer than the byte order marks for UTF-16 despite being a NOOP). All it does is mark that a file is UTF-8 and not ASCII. – Powerlord Mar 02 '15 at 17:11
  • Good point regarding the 1960's. There was nothing wrong in 1960's except that computing kind of sucked. – diynevala May 19 '15 at 11:11
6

Important is at least that you need to be consistent with the encoding used to avoid herrings. Thus not, X here, Y there and Z elsewhere. Save source code in encoding X. Set code input to encoding X. Set code output to encoding X. Set characterbased FTP transfer to encoding X. Etcetera.

Nowadays UTF-8 is a good choice as it covers every character the human world is aware of and is pretty everywhere supported. So, yes, I would set workspace encoding to it as well. I also use it so.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • What herrings? If source is built on Windows and executed on *nix would that be a good reason to define your encoding? – JARC Feb 01 '10 at 17:18
  • I assume these are rare but very possible. – JARC Feb 01 '10 at 17:28
  • For example, yes. Default encoding namely differs at both platforms. This does not affect technical functionality of Java code in any way however (Java literals/keywords are namely already part of ASCII, which is basically the base of all other encodings (expect of EBCDIC, but that's a different story)), but it *may* result in erroneous input/output. – BalusC Feb 01 '10 at 17:29
  • No, Java identifier are not necessarily only Ascii char. This is a valid int declaration (at least javac and eclipse accept that): int é\u1212; – penpen Feb 01 '10 at 19:24
  • @penpen: I was talking about **literals/keywords** like `public`, `class`, `null`, etc, not about identifiers. – BalusC Feb 01 '10 at 20:05
  • Sorry, I should have took my time before commenting. – penpen Feb 01 '10 at 23:10
6

Eclipse's default setting of using the platform default encoding is a poor decision IMHO. I found it necessary to change the default to UTF-8 shortly after installing it because some of my existing source files used it (probably from snippets copied/pasted from web pages.)

The Java Language and API specs require UTF-8 support so you're definitely okay as far as the standard tools go, and it's a long time since I've seen a decent editor that did not support UTF-8.

Even in projects that use JNI, your C sources will normally be in US-ASCII which is a subset of UTF-8 so having both open in the same IDE will not be a problem.

finnw
  • 47,861
  • 24
  • 143
  • 221
  • What about users trying to compile their old source files with special characters in them? The eclipse's decision seems to be directly linked to the behaviour of javac, which by default uses platform's default encoding. – Adam Kurkiewicz Oct 15 '17 at 15:23
  • 1
    Eclipse nowadays uses UTF-8 by default. Don’t know in which version it changed, though. – Holger Jun 19 '23 at 08:09
2

Yes, unless your compiler/interpreter is not able to work with UTF-8 files, it is definitely the way to go.

poke
  • 369,085
  • 72
  • 557
  • 602
  • 1
    ...which in javac can be controlled with `-encoding` argument by the way. Good point though, +1. – BalusC Feb 01 '10 at 16:51
2

I don't think there's really a straight yes or no answer to this question. I would say that the following guidelines should be used to pick an encoding format, in order of priority listed (highest to lowest):

1) Pick an encoding your tool chain supports. This is a lot easier than it used to be. Even in recent memory a lot of compilers and languages basically only supported ASCII, which more or less forced developers into coding in Western European languages. These days, many of the newer languages support other encodings, and almost all decent editors and IDEs support a tremendously long list of encodings. Still... there are just enough holdouts that you need to double check before you settle on an encoding.

2) Pick an encoding that supports as many of the alphabets you wish to use as possible. I place this as a secondary priority because frankly, if your tools don't support it it doesn't really matter whether you like the encoding better or not.

UTF-8 is an excellent choice in many circumstances of today's world. It's an ugly, inelegant format, but it solves a whole host of problems (namely dealing with legacy code) that break other encodings, and it seems to becoming more and more the de facto standard of character encodings. It supports every major alphabet, darn near every editor on the planet supports it now, and a whole host of languages/compilers support it, too. But as I mentioned above, there are just enough legacy holdouts that you need to double check your tool chain from end to end before you settle on it definitively.

Russell Newquist
  • 2,656
  • 2
  • 16
  • 18
  • 3
    Strongly disagree with the "ugly, inelegant format" part. UTF-8 is pretty much a masterpiece as far as I'm concerned: backwards-compatible, more space-efficient than most people think (yes, even for Asian languages), can be picked up mid-stream, easily identifiable in most cases, doesn't require a BOM, binary-sortable... – Cowan Feb 01 '10 at 22:51
  • Don't misunderstand me - given the constraints under which they were working, I'm quite impressed with the format. But the honest reality is that if we were starting from scratch today, we'd just be using a straight 32 or 64-bit character set, end of story. Pure elegance in its simplest form. – Russell Newquist Feb 02 '10 at 20:27
  • You really should NOT pick any encoding other than UTF-8 or ASCII. UTF-8 supports all the Java characters (that is important). ASCII does not, but is portable everywhere. Any other choice for encoding is likely to be a problem somewhere along the line. – AgilePro Mar 01 '13 at 15:32