I am trying to convert a string encoded in java in UTF-8 to ISO-8859-1. Say for example, in the string 'âabcd' 'â' is represented in ISO-8859-1 as E2. In UTF-8 it is represented as two bytes. C3 A2 I believe. When I do a getbytes(encoding) and then create a new string with the bytes in ISO-8859-1 encoding, I get a two different chars. â. Is there any other way to do this so as to keep the character the same i.e. âabcd?
8 Answers
If you're dealing with character encodings other than UTF-16, you shouldn't be using java.lang.String
or the char
primitive -- you should only be using byte[]
arrays or ByteBuffer
objects. Then, you can use java.nio.charset.Charset
to convert between encodings:
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xC3, (byte)0xA2});
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();

- 390,455
- 97
- 512
- 589
-
Good point, although I would suggest that use of buffers may not always be the most convenient way. Basic `InputStream` and `OutputStream` (with appropriate wrapping Readers, Writers) are sometimes more useful, and do not require the whole content to be kept in memory. But which is more convenient depends on use case of course. – StaxMan Jun 04 '15 at 21:04
byte[] iso88591Data = theString.getBytes("ISO-8859-1");
Will do the trick. From your description it seems as if you're trying to "store an ISO-8859-1 String". String objects in Java are always implicitly encoded in UTF-16. There's no way to change that encoding.
What you can do, 'though is to get the bytes that constitute some other encoding of it (using the .getBytes()
method as shown above).

- 15,171
- 8
- 38
- 76

- 302,674
- 57
- 556
- 614
Starting with a set of bytes which encode a string using UTF-8, creates a string from that data, then get some bytes encoding the string in a different encoding:
byte[] utf8bytes = { (byte)0xc3, (byte)0xa2, 0x61, 0x62, 0x63, 0x64 };
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-1");
String string = new String ( utf8bytes, utf8charset );
System.out.println(string);
// "When I do a getbytes(encoding) and "
byte[] iso88591bytes = string.getBytes(iso88591charset);
for ( byte b : iso88591bytes )
System.out.printf("%02x ", b);
System.out.println();
// "then create a new string with the bytes in ISO-8859-1 encoding"
String string2 = new String ( iso88591bytes, iso88591charset );
// "I get a two different chars"
System.out.println(string2);
this outputs strings and the iso88591 bytes correctly:
âabcd
e2 61 62 63 64
âabcd
So your byte array wasn't paired with the correct encoding:
String failString = new String ( utf8bytes, iso88591charset );
System.out.println(failString);
Outputs
âabcd
(either that, or you just wrote the utf8 bytes to a file and read them elsewhere as iso88591)

- 48,893
- 5
- 92
- 171
This is what I needed:
public static byte[] encode(byte[] arr, String fromCharsetName) {
return encode(arr, Charset.forName(fromCharsetName), Charset.forName("UTF-8"));
}
public static byte[] encode(byte[] arr, String fromCharsetName, String targetCharsetName) {
return encode(arr, Charset.forName(fromCharsetName), Charset.forName(targetCharsetName));
}
public static byte[] encode(byte[] arr, Charset sourceCharset, Charset targetCharset) {
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
CharBuffer data = sourceCharset.decode(inputBuffer);
ByteBuffer outputBuffer = targetCharset.encode(data);
byte[] outputData = outputBuffer.array();
return outputData;
}

- 31
- 2
If you have the correct encoding in the string, you need not do more to get the bytes for another encoding.
public static void main(String[] args) throws Exception {
printBytes("â");
System.out.println(
new String(new byte[] { (byte) 0xE2 }, "ISO-8859-1"));
System.out.println(
new String(new byte[] { (byte) 0xC3, (byte) 0xA2 }, "UTF-8"));
}
private static void printBytes(String str) {
System.out.println("Bytes in " + str + " with ISO-8859-1");
for (byte b : str.getBytes(StandardCharsets.ISO_8859_1)) {
System.out.printf("%3X", b);
}
System.out.println();
System.out.println("Bytes in " + str + " with UTF-8");
for (byte b : str.getBytes(StandardCharsets.UTF_8)) {
System.out.printf("%3X", b);
}
System.out.println();
}
Output:
Bytes in â with ISO-8859-1
E2
Bytes in â with UTF-8
C3 A2
â
â

- 41,222
- 15
- 102
- 148
For files encoding...
public class FRomUtf8ToIso {
static File input = new File("C:/Users/admin/Desktop/pippo.txt");
static File output = new File("C:/Users/admin/Desktop/ciccio.txt");
public static void main(String[] args) throws IOException {
BufferedReader br = null;
FileWriter fileWriter = new FileWriter(output);
try {
String sCurrentLine;
br = new BufferedReader(new FileReader( input ));
int i= 0;
while ((sCurrentLine = br.readLine()) != null) {
byte[] isoB = encode( sCurrentLine.getBytes() );
fileWriter.write(new String(isoB, Charset.forName("ISO-8859-15") ) );
fileWriter.write("\n");
System.out.println( i++ );
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
fileWriter.flush();
fileWriter.close();
if (br != null)br.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
static byte[] encode(byte[] arr){
Charset utf8charset = Charset.forName("UTF-8");
Charset iso88591charset = Charset.forName("ISO-8859-15");
ByteBuffer inputBuffer = ByteBuffer.wrap( arr );
// decode UTF-8
CharBuffer data = utf8charset.decode(inputBuffer);
// encode ISO-8559-1
ByteBuffer outputBuffer = iso88591charset.encode(data);
byte[] outputData = outputBuffer.array();
return outputData;
}
}

- 1,121
- 13
- 21
In addition to Adam Rosenfield's answer, I would like to add that ByteBuffer.array()
returns the buffer's underlying byte array, which is not necessarily "trimmed" up to the last character. Extra manipulation will be needed, such as the ones mentioned in this answer; in particular:
byte[] b = new byte[bb.remaining()]
bb.get(b);

- 737
- 1
- 6
- 21
evict non ISO-8859-1 characters, will be replace by '?' (before send to a ISO-8859-1 DB by example):
utf8String = new String ( utf8String.getBytes(), "ISO-8859-1" );

- 13
-
4Replacing all the non-ASCII characters with `?` seems like a terrible solution when it's possible to convert the string without losing them. – s4y Mar 25 '11 at 16:30
-
@s4y you are right, that it seems like a terrible solution, but think about ASCII. You simply cannot have umlauts in ASCII. You will have to do _something_ with the characters that cannot be encoded. For the problem at hand, this is the simplest and a correct solution. One might consider using StandardCharsets.ISO_8859_1. – fahrradfahrer Jul 07 '20 at 11:30
-
1@fahrradfahrer for what it's worth, if I were writing that comment today, I wouldn't have used the word "terrible"! But for that case, I'd probably go with something like https://stackoverflow.com/a/14121678/84745, which essentially gives you an approximation of the string in ASCII. – s4y Jul 08 '20 at 04:13