How to convert UTF-8 character to ISO Latin 1?

Question

I need to convert a UTF-8 trademark sign to a ISO Latin 1, and save it into database, which is also ISO Latin 1 encoded.

How can I do that in java?

I've tried something like

String s2 = new String(s1.getBytes("ISO-8859-1"), "utf-8");

but it seems not work as I expected.

Have a look at http://stackoverflow.com/questions/285228/how-to-convert-utf-8-to-us-ascii-in-java Not exactly a duplicate, but similar. — Paul Tomblin, Mar 11 '09 at 14:30

Jon Skeet · Answer 1 · 2009-03-11T14:37:34.687

A string in Java is always in Unicode (UTF-16, effectively). Conversions are only necessary when you're trying to go from text to a binary encoding or vice versa.

What's the character involved? Are you sure it's even present in ISO Latin 1? If it is, I'd expect that character to be stored by your database without any problem. There's no such thing as a "UTF-8 trademark sign". You could have "the bytes representing the trademark sign UTF-8 encoded" but that would be a byte array, not a string.

EDIT: If you mean the Unicode trademark character U+2122, that's outside the range of ISO-Latin-1. There's the registered trademark character U+00AE, which isn't the same thing (either in appearance or in legal meaning, IIRC) but may be better than nothing - if you want to use that then just use:

string replaced = original.replace('\u2122', '\u00ae');

Hence "isn't the same thing (either in appearance or legal meaning" — Jon Skeet, Mar 11 '09 at 14:52

score 4 · Answer 2 · answered Feb 20 '14 at 12:42

As far as I understand, you are trying to store characters (from s1) that contains non Latin-1 characters into a DB that only supports ISO-8859-1.

First, I agree with the others to say that it is a dirty idea.
Note that CP1252 is close from ISO-8859-1 (1 byte per character) and includes ™
Now, to anwser your question, I think you did the opposite..
You want to encode UTF-8 bytes into ISO-8859-1 :
```
String s2 = new String(s1.getBytes("UTF-8"), "ISO-8859-1");
```
This way, s2 is a characher String that, once encoded in ISO-8859-1, will return a byte array which may look like valid UTF-8 bytes.

To retrieve the original string, you would do
```
String s1 = new String(s2.getBytes("ISO-8859-1"),"UTF-8");
```

BUT WAIT ! When doing this, you hope that any byte can be decoded with ISO-8859-1 .. and that your DB will accept such data. etc..

In fact, it is really unsure because officially, ISO-8859-1 doesn't have chars for any byte values. For instance, from 80 to 9F.

Then,

byte[] b = { -97, -100, -128 };
System.out.println( new String(b,"ISO-8859-1") );

would display ???

However, in Java, s.getBytes("ISO-8859-1") indeed restores the initial array.

score 2 · Answer 3 · answered Mar 11 '09 at 14:34

Read what Jon Skeet told you. The Code you posted is rubbish (it takes the UTF-8 encoded form of your String and interprets it as if it were ISO-8859-1, this accomplishes nothing useful).
The ISO-8859-1 encoding (a.k.a Latin1) doesn't contain the Trademark character "™".

juwens · Answer 4 · 2012-02-28T14:49:27.537

-1

I had a similar problem and solved it by converting the the none-translatable chars in Entitys. If you display the information later as html you are fine anyway.

If not, you could try to convert them back to unicode.

example in python with "Trademark":

s = u'yellow bananas\u2122'.encode('latin1', 'xmlcharrefreplace')
# s is 'yellow bananas&#8482;'

edited Feb 28 '12 at 14:49

answered Feb 22 '12 at 10:11

juwens

3,729
4
31
39

1

OK great, but in Java? the original question has the #Java tag... – maxxyme Jan 14 '22 at 13:29

How to convert UTF-8 character to ISO Latin 1?

4 Answers4