5

I was reading this

Should source code be saved in UTF-8 format

and I am using the eclipse compiler lib but need to read some java source files in to feed it to that library. IT seems it can be stored in different formats from that post.

Is there one Charset I can use to read it in so it works every time. Charset.forName("UTF-8") maybe?

thanks, Dean

Community
  • 1
  • 1
Dean Hiller
  • 19,235
  • 25
  • 129
  • 212
  • 2
    No. You have to read the file using the same character set that was used to save the file, whatever that may be. Usually, that is the PC's native code page, e.g. `CP1252` if you're running Windows in USA. You can standardize your own Java sources to always use UTF-8, if you want, but any files you get from elsewhere may need to be converted. – Andreas Jun 04 '16 at 07:27
  • so is there a way to detect the file encoding then? – Dean Hiller Jun 04 '16 at 14:36
  • Unfortunately no. UTF-16 files can usually be easily identified by a BOM. UTF-8 files shouldn't have a BOM, so there is no way to really see the difference between a UTF-8, CP1252, ISO 8859-1, CP1251, or any other code page. – Andreas Jun 04 '16 at 14:41
  • @DeanHiller No. There are hundreds of encodings and most of them could be used to misinterpret a particular text file, without throwing an exception during the decoding or compiler error. CP437 would work for any file. UTF-8 would not. – Tom Blodget Jun 04 '16 at 19:48

1 Answers1

6

Character encodings vary

Any tool can write Java source code in any encoding. Even the idea of .java file is not defined by the Java Language Spec. Any IDE can persist Java source code any way it wants with any encoding.

The tools are responsible for ultimately providing a Unicode-compliant stream of characters into the compiler toolchain. How they collect and persist the source code is up to the particular tools.

The Java Language Specification states in Chapter 3 Lexical Structure:

Programs are written using the Unicode character set. Information about this character set and its associated character encodings may be found at http://www.unicode.org/.

So presumably a Java source code file would use one of character encodings common with Unicode such as UTF-8, UTF-16, or UCS-2.

Section 3.2 Lexical Translations mentions that a Java program could use an encoding such as ASCII by embedding Unicode escapes:

Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx.

While UTF-8 is common in my experience, that is not the only possible encoding. You must know or guess the encoding of any particular source file, and you must account for expanding any Unicode escapes.

Other issues

By the way, note that at least in the Oracle JDK, the byte order mark (BOM) optional to UTF-8 files is not allowed in Java due to a bug (JDK-4508058) that will never be fixed (because of backward-compatibility concerns).

Also note that line terminators may vary: the ASCII characters CR (CARRIAGE RETURN), or LF (LINE FEED), or CR LF.

White space varies: SPACE (SP), CHARACTER TABULATION (HT) (horizontal tab), FORM FEED (FF), and line terminators.

Read the spec for additional details. For example, regarding the SUBSTITUTE character:

As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream.

About character encoding

Be sure you understand the basics of Unicode and of character encoding. Best place to start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky.


Even supposed rules such as “one public class per .java file” may be defined by particular tools rather than by Java itself. The CodeWarrior tools for Java way-back-when supported multiple classes per file.

Basil Bourque
  • 303,325
  • 100
  • 852
  • 1,154
  • 2
    Nice write-up, but you don't fully cover the point of the question, i.e. the encoding of the `.java` source files. [`javac`](https://docs.oracle.com/javase/8/docs/technotes/tools/windows/javac.html) will default to the OS code page: *If the `-encoding` option is not specified, then the **platform default** converter is used.* Eclipse (mentioned in comment) can handle that every `.java` source file is using a different code page, but if you ever want to compile outside of Eclipse, you better use only one code page for all your source files. If not the default, it has to be explicitly given. – Andreas Jun 04 '16 at 14:46
  • @Andreas Seems you are focused on the *output*, on the supposed fact that Eclipse tolerates a mixture of source files in various character encodings. If true for Eclipse in general and for the “eclipse compiler lib” in particular, (I don't know such facts), that does seem important enough to warrant posting as another Answer here. But my Answer addresses the title (“when reading in a java source file”) and the last sentence (“one Charset I can use to read”), about *input*, about what character encoding to expect with Java source files. As my opening header says, “Character encodings vary”. – Basil Bourque Jun 07 '16 at 00:00
  • Don't know where you got *output* from. I'm talking about the encoding of `.java` source files, i.e. the *input* to the java compiler. My point was, that although Eclipse supports mixed character encodings, `javac`, Ant, Maven, Gradle, etc. all do mass compilation using a single charset, so it's a good idea to use a single charset for all source files. Sure, if you only ever build using Eclipse, you can use the mixed charset feature, but not having a build tool in addition to the IDE tool is rare, at least outside schools. – Andreas Jun 07 '16 at 00:59