Transforming unicode characters to a string containing their u+[hexa] representation ("\u2030")

Question

I am working with java 8 and I18N. From my understandings, the .properties files (and subsequent I18N code) asumes that the files are in the "ISO-8859-1" file format. Thus I'm having trouble with characters that cannot be represented in that file format.

Changing from a file writer to an OutputStreamWriter won't help since the other end of the code won't be able to read these characters anyway.

I did come up with a solution that works, but it is highly inelegant.

StringBuilder utfRepresentation = new StringBuilder();
for (int index = 0; index < input.length(); index++) {
    if (!Charset.forName("ISO-8859-1").newEncoder().canEncode(input.charAt(index))) {
        utfRepresentation.append("\\u");
        utfRepresentation.append(Integer.toHexString(input.codePointAt(index)));
    } else {
        utfRepresentation.append(input.charAt(index));
    }
}

Now I do need to do other things like extract the encoder instead of making a new one every time, but my question is something else entirely:

1) Is there a cleaner way of transforming ‰ into \u2030
2) What even is this U+2030? UTF-8/16?
3) Is there a better way of creating that charset / encoder? Something that isn't static? can I extract it from the file? or a file reader / writer?

I assume you're using `ResourceBundle.getBundle("yourfile.properties")` or the variant accepting a locale as well, correct? Unfortunately that method will assume the files to be ISO-8859-1 encoded. However, if you use the variant that accepts a `Control` instance you can provide an instance that actually loads the .properties file with UTF-8 encoding. More information can be found here (second part of BalusC's answer): https://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-resource-properties-with-resourcebundle/4660195 — Thomas, Mar 11 '19 at 08:42
Like I said, that is not a solution, check [this](https://docs.oracle.com/javase/9/intl/internationalization-enhancements-jdk-9.htm#JSINT-GUID-974CF488-23E8-4963-A322-82006A7A14C7) — Kalec, Mar 11 '19 at 08:45
I'm not quite sure why you say that is not a solution. I assume you are refering to "since the other end of the code won't be able to read these characters anyway." but in that case you should provide information about what that other end of the code is and whether you can change it or not. Also note that the link you've provided states that as of Java 9 the default for .properties will now be UTF-8, so you'd "just" need to make sure all characters are encoded correctly. — Thomas, Mar 11 '19 at 08:49
The I18N classes extend eclipse NLS, it is the framework that does the magic, I cannot touch anything related to the framework, I have no intention of even trying. Check [this](https://github.com/eclipse/smarthome/issues/2639) also. — Kalec, Mar 11 '19 at 09:02
"The I18N classes extend eclipse NLS" - ok that's another matter. It would have helped to state so in your question. You _did_ mention "I18N" but there are many possibilities for what you are referring to: i18n in general (as in "internationalization" like the tag you've used), I18N classes from various frameworks etc. - Note that I don't intend to bitch about your question, it's just meant as feedback for future questions because having more details will enable to help in more specific ways (often questions like this look like a [xy-problem](https://mywiki.wooledge.org/XyProblem)). — Thomas, Mar 11 '19 at 09:33

Joop Eggen · Accepted Answer · 2019-03-11T09:20:23.353

3

As a historical anomaly, .properties are in ISO-8859-1, for which you can use StandardCharsets.ISO_8859_1 (if not on Android).

However you may use for other characters the u-escaping: \u2030 where one should understand that this is a representation of UTF-16 as stored in a single char (two bytes). Some Unicode symbols exceed the two byte limit, and are encoded in a "surrogate" pair.

When reading from a PropertyResourceBundle, every \uXXXX will be automatically decoded
You could the build convert a UTF-8 template file into u-escaped .properties; for instance in maven.
Sometimes a ListResourceBundle is a better fit. It has an array in java, all java sources could be set to UTF-8 for an international project. Its behavior is different: all strings are loaded immediately.

However evidently you also want to write to .properties in code; hence not on the class path.

Here best seems Properties

For that the Properties class is ideal. It has an XML variant (instead of key-value lines) for the properties, which by default use UTF-8. But also traditional .properties can be read and written in another (UTF-8) encoding.

StringBuilder utfRepresentation = new StringBuilder();
for (int index = 0; index < input.length(); index++) {
    char ch = input.charAt(index);
    if (ch < 128) {
        utfRepresentation.append(ch);
    } else {
        utfRepresentation.append(String.format("\\u%04X", ch));
    }
}

edited Mar 11 '19 at 09:20

answered Mar 11 '19 at 09:06

Joop Eggen

107,315
7
83
138

1

I hope, you understood that the `Properties` class is the same as used by the `PropertyResourceBundle` to load the `.properties` file. When you invoke the [`Properties.store(OutputStream,String)`](https://docs.oracle.com/javase/10/docs/api/java/util/Properties.html#store(java.io.OutputStream,java.lang.String)) method, you’re writing a `.properties` file with the non-latin characters encoded, suitable to be loaded on the other side, whether using `PropertyResourceBundle` or `Properties` directly. For a manual loop, `ch < 256` would be more natural, as we’re talking about iso-latin-1, not ascii. – Holger Mar 11 '19 at 10:29
@Holger ***1.*** Indeed there is even outcommented javadoc of XML usage (XMLResourceBundle) in ResourceBundle. Of course in the question one would like to write to properties, which is not part of ResourceBundle. ***2.*** Thanks, `ch < 256` would be fine and more consistent, though for non-ISO-8859-1 IDE locales editing must still be done in ISO-8859-1 (Latin-1), i.e. not in Windows-1252 (Windows Latin-1) which differs in U+80 - U+9F. My personal opinion only: either extended/XML properties and use UTF-8 (my preference), or use ASCII. ISO-8859-1 does not even work for French: `œ`. – Joop Eggen Mar 12 '19 at 08:30
1

You could say it isn’t even enough for American English if you consider punctuation like “quotes”, unless you restrict yourself to typewriter style. Anyway, the simplest solution is to switch to Java 9 or newer, where you can just write your `.properties` file using UTF-8 and `PropertyResourceBundle` will load it correctly. – Holger Mar 12 '19 at 08:37

Transforming unicode characters to a string containing their u+[hexa] representation ("\u2030")

1 Answers1