Handling non-english characters using Eclipse

Question

Below is the text that I would like to paste in bundle.properties file using Eclipse.

Честит рожден ден

Instead Eclipse displays these characters in unicode escape notation, as shown below: \u0427\u0435\u0441\u0442\u0438\u0442 \u0440\u043E\u0436\u0434\u0435\u043D \u0434\u0435\u043D

How do I resolve this problem?

What is the default encoding of your eclipse installation and of the file? — aw-think, Jun 07 '15 at 08:15
properties file are supposed to be encoded in ISO-8859-1, which doesn't support your characters. That's why unicode escape sequences are necessary. Hovering over a property, or pressing F2, shows the "original" text. But I don't think you can view the original text in the editor in Eclipse. IntelliJ can do that. — JB Nizet, Jun 07 '15 at 08:31
This is wrong, since you use cyrillic characters switch all to utf-8. Can be done by this: http://stackoverflow.com/questions/3751791/how-to-change-default-text-file-encoding-in-eclipse or that: http://stackoverflow.com/questions/9180981/how-to-support-utf-8-in-eclipse — aw-think, Jun 07 '15 at 10:57
@NwDx but with current setting, if I do system.out.printtln on that Unicode notation, that gets printed as actual Bulgarian character. How do I understand this? — overexchange, Jun 07 '15 at 11:01
because your console the correct encoding is set. Eclipse only uses your system console, so this is correct. — aw-think, Jun 07 '15 at 11:04
You do not have to set it, it's set by your operation system like windows or linux. So if you see in other applications your characters right (cyrillic) everything is fine. Only the file encoding of eclipse is wrong. — aw-think, Jun 07 '15 at 11:10
@NwDx what about run-> runconfigurations -> commontab -> console encoding? In eclipse? — overexchange, Jun 07 '15 at 11:17
@NwDx When we write java code, we must use UTF-16 format, like`char ch = '\u0041'` or `int \u00A5 = 200;`. So, How do you recommend file to be UTF-8 java file? — overexchange, Jun 08 '15 at 08:12
UTF-8 is that what you need to write and show cyrillic characters. So all supersets are even possible, but as far as I know, all dev's work with UTF-8 because you can use it on Mac/Linux/Windows without trouble. Eclipse uses cp1252 as a default. In env's with a heterogeneous os'es all people switch to UTF-8. — aw-think, Jun 08 '15 at 08:13
@NwDx I did not get you. [JLS-section3.1](https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.1) says `The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding` section 3.3 gives syntax as `\ UnicodeMarker HexDigit HexDigit HexDigit HexDigit`. So, the complete java code can be written in UTF-16 format like `int \u00A5 = 200;`. — overexchange, Jun 08 '15 at 08:18

score 1 · Answer 1 · answered Oct 29 '18 at 10:16

This is intended behavior. The PropertyResourceBundle relies on the Properties class, whose load method always assumes the file to be encoded is iso-latin-1¹:

The input stream is in a simple line-oriented format as specified in load(Reader) and is assumed to use the ISO 8859-1 character encoding; that is each byte is one Latin1 character. Characters not in Latin1, and certain special characters, are represented in keys and elements using Unicode escapes as defined in section 3.3 of The Java™ Language Specification.

So converting your copied characters to Unicode escape sequences in the right thing to ensure that they will be loaded properly. At runtime, the ResourceBundle will contain the right character content.

While in Eclipse, source files usually inherit the charset setting from their parent, to end up at the project or even system wide setting, it supports setting the charset encoding for single files and conveniently changes it automatically to iso-latin-1 for .properties files.

Note that starting with Java 9, you can use UTF-8 for properties resource bundles. This does not require additional configuration actions, as the charset encoding is determined by probing. As the documentation of the PropertyResourceBundle(InputStream) constructor states:

This constructor reads the property file in UTF-8 by default. If a MalformedInputException or an UnmappableCharacterException occurs on reading the input stream, then the PropertyResourceBundle instance resets to the state before the exception, re-reads the input stream in ISO-8859-1 and continues reading. If the system property java.util.PropertyResourceBundle.encoding is set to either "ISO-8859-1" or "UTF-8", the input stream is solely read in that encoding, and throws the exception if it encounters an invalid sequence.

This works, as both encodings are identical for ASCII characters, while for non-ASCII sequences, it practically never happens for real life text that an iso-latin-1 sequence forms a valid UTF-8 sequence. This applies to PropertyResourceBundle which handles this probing, not for the Properties class, which still only uses iso-latin-1 in its load(InputStream) method.

¹ I kept the statement in this absolute form for simplicity, despite, as elaborated at the end of this answer, Java 9 has lifted this restriction.

Handling non-english characters using Eclipse

1 Answers1