Java convert Windows-1252 to UTF-8, some letters are wrong

Question

I receive data from a external Microsoft SQL 2008 database (I make queries with MyBatis). The data is encoded as "Windows-1252".

I have tried to re-encode to UTF-8:

String textoFormado = ...value from MyBatis... ; 
String s = new String(textoFormado.getBytes("Windows-1252"), "UTF-8");

Almost the whole string is correctly decoded, but some letters with accents are not.

For example:

I received this: Ã�vila
The code above makes: �?vila
I expected: Ávila

Break your line into two statements, so you can get a look at the intermediate string. That will help you see what the source of the problem might be. — rossum, Apr 15 '14 at 11:41
Thank. But I tried String s = new String(mistring.getBytes("Windows-1252")); but the result is the same. — Ramon, Apr 15 '14 at 13:16
Your variable _textoFormado_ is already a string that you simply can use in your program. Why are thinking you must encode and decode it again? — Seelenvirtuose, Apr 15 '14 at 13:44
Because this String have the text "Ã�vila" (it is recived from database from MyBatis) and I need "Ávila". — Ramon, Apr 15 '14 at 14:09
How are you retrieving the String from MyBatis? That is where you need to deal with a charset conversion from Windows-1252 to UTF-16 (Java's native String encoding). Even if you have to use `getBytes()`, you should be specifying `Windows-1252` instead of `UTF-8` in the `String` constructor since you are not dealing with UTF-8 bytes at all. — Remy Lebeau, Apr 16 '14 at 18:55

Seelenvirtuose · Answer 1 · 2014-04-15T11:50:18.167

11

Obviously, textoFormado is a variable of type String. This means that the bytes were already decoded. Java then internally uses a 16-bit Unicode representation. What you did, is to encode your string with Windows-1252 followed by reading the resulting bytes with an UTF-8 encoding. That does not work.

What you need is the correct encoding when reading the bytes:

byte[] sourceBytes = getRawBytes();
String data = new String(sourceBytes , "Windows-1252");

For using this string inside your program, you do not need to do anything. Simply use it. If - however - you want to write the data back to a file for example, you need to encode again:

byte[] destinationBytes = data.getBytes("UTF-8");
// write bytes to destination file here

edited Apr 15 '14 at 11:50

answered Apr 15 '14 at 11:44

Seelenvirtuose

20,273
6
37
66

Thanks for answering, it have sense and it give me some ideas. But I use MyBatis for to launch queries to the database, this return me text in type String. I tried the next code for return back, but dont coded correctly: byte[] textBytes = textoFormado.getBytes("UTF-8"); String value = new String(textBytes , "Windows-1252"); – Ramon Apr 15 '14 at 13:09
3

Use `textoFormado.getBytes("Windows-1252")` instead. Forget about UTF-8, it does not apply in this situation, and you are not using it correctly anyway. – Remy Lebeau Apr 16 '14 at 18:57

score 1 · Accepted Answer · answered Apr 21 '14 at 09:53

I solved it thanks to all.

I have the next project structure:

MyBatisQueries: I have a query with a "select" which gives me the String
Pojo to save the String (which gave me the String with conversion problems)
The class which uses the query and the Pojo object with data (that showed me bad decoded)

at first I had (MyBatis and Spring inject dependencies and params):

public class Pojo {
    private String params;
    public void setParams(String params) {
        try {
            this.params = params;
        }
    }

}

The solution:

public class Pojo {
    private String params;
    public void setParams(byte[] params) {
        try {
            this.params = new String(params, "UTF-8");
        } catch (UnsupportedEncodingException e) {
            this.params = null;
        }
    }

}

score 1 · Answer 3 · answered Jul 14 '21 at 18:02

Why not tackling the issue at a lower level: reading the String in proper encoding from your database.

Most JDBC connection-string or URIs support the property characterEncoding.

So in you Microsoft SQL Server case you could have for example jdbc:sqlserver://localhost:52865;databaseName=myDb?characterEncoding=utf8.

Then each String column should be read in the specified encoding without the need to (re-)convert it manually to it.

Java convert Windows-1252 to UTF-8, some letters are wrong

3 Answers3

Linked