Is it safe to temporarily store UTF-8 strings as ISO-8859-1 in Java?

Question

I have a properties file that is encoded as UTF-8 called theProperties.properties:

property1=Some Chinese Characters: 会意字會意字
property2=More chinese Char - 假借
property2=<any other valid UTF-8 characters>

I use a resource bundle to pull in the localized strings:

ResourceBundle localizedStrings = ResourceBundle.getBundle(
    "theProperties.properties",
    locale
);

Resource bundle assumes that all strings are in ISO-8859-1 my resource files are encoded as UTF-8. I need to convert the string to UTF-8

Is it safe to wrap resource bundle and pull strings out of it like this:

public String getLocalizedString(String key){
    String localizedString_ISO_8859_1 = localizedStrings.getString(key);
    String localizedString_UTF_8 = new String(localizedString_ISO_8859_1.getBytes("ISO-8859-1"), "UTF-8");
    return localizedString_UTF_8;
}

Are there any times when this code is unsafe? It feels like it may be unsafe but strings are immutable does that mean that the bytes underneath are also immutable?

There are other ways to do this but this method is shorter so if it is safe I would prefer to go with this.

This is the alternate way of solving this issue, but it is a bit longer and from a ease of read perspective I like the above better since this solution is only changing a single line in the Control class.

storing utf8 in 8859 is going to mangle the chars. there is no chinese support in 8859, so you're going to end up with garbage. — Marc B, Dec 02 '13 at 18:41
Interesting. For the above usecase, it works fine. I think that ISO-8859-1 can't render the characters correctly but it isn't changing the underlying byte array. Was wondering if I could find a counter example where it would change the underlying byte array. — sixtyfootersdude, Dec 02 '13 at 18:44
why do you say that ResourceBundle assumes all strings are ISO-8859-1? — jtahlborn, Dec 02 '13 at 18:45
You store bytes as bytes, the encoding is an interpretation. If you store a four-byte UTF-8 sequence in a string which is interpreted as ISO 8859-1 and print it, you will get four characters which look nothing like what you put there, but if you pull them back out into a context where something displays them as UTF-8, they're still the same four bytes. — tripleee, Dec 02 '13 at 18:46
@jtahlborn see: http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html — sixtyfootersdude, Dec 02 '13 at 18:51
@jtahlborn - Is there a straight forward way to do that in the build process? Are there unicode escapes for every UTF-8 char? -- Will look into it. — sixtyfootersdude, Dec 02 '13 at 19:15

Joop Eggen · Answer 1 · 2013-12-02T18:59:47.647

That should work, though utterly ugly as bending everything needing a large comment.

It works as:

Every byte of the UTF-8 multi-byte string is taken as char by Java.
Converting that string to ISO-8859-x bytes makes every char a byte.
The interpreting those bytes as UTF-8 yields the correct interpretation.

If you have a build infrastructure like maven, there are plugins to convert the encoding from src to build directory.

Also there are .properties editors with a wysiwig editing.

Cleanest might maybe to write your own ListResourceBundle child or such. Simply not (ab)using .properties. See the JRE for example usage.

score 0 · Answer 2 · answered Dec 02 '13 at 18:51

It should work the way you do it, here is why:

When Java reads and interprets the bytes of the properties file, it will just use the unsigned byte values as char values - this works, because, fortunately, the first 256 code points have the same encodings in Unicode, and since Strings are internally stored as UTF-16, no surrogate characters or other complicated things are needed. Hence, translation from and to bytes pretending it is ISO-8859 works without loss.

score 0 · Answer 3 · answered Dec 02 '13 at 19:26

0

This is fine, because ISO-8859-1 has a one-one mapping between bytes and its char set.

Anytime you need a byte[] but you are forced to use a String, you should use ISO-8859-1 as the mapping, which is the fastest since it is essentially the identity mapping.

answered Dec 02 '13 at 19:26

ZhongYu

19,446
5
33
61

So every byte will map cleanly to a char in ISO-8859-1 and will map to the same byte when converted back? – sixtyfootersdude Dec 02 '13 at 19:52
yes. conversions between byte and char: `b=(byte)c` and `c=(char)(b&0xff)` – ZhongYu Dec 02 '13 at 19:58

Is it safe to temporarily store UTF-8 strings as ISO-8859-1 in Java?

3 Answers3