6

I receive data from a external Microsoft SQL 2008 database (I make queries with MyBatis). The data is encoded as "Windows-1252".

I have tried to re-encode to UTF-8:

String textoFormado = ...value from MyBatis... ; 
String s = new String(textoFormado.getBytes("Windows-1252"), "UTF-8");

Almost the whole string is correctly decoded, but some letters with accents are not.

For example:

  1. I received this: �vila
  2. The code above makes: �?vila
  3. I expected: Ávila
ifly6
  • 5,003
  • 2
  • 24
  • 47
Ramon
  • 85
  • 1
  • 2
  • 8
  • Break your line into two statements, so you can get a look at the intermediate string. That will help you see what the source of the problem might be. – rossum Apr 15 '14 at 11:41
  • Thank. But I tried String s = new String(mistring.getBytes("Windows-1252")); but the result is the same. – Ramon Apr 15 '14 at 13:16
  • 2
    Your variable _textoFormado_ is already a string that you simply can use in your program. Why are thinking you must encode and decode it again? – Seelenvirtuose Apr 15 '14 at 13:44
  • Because this String have the text "Ã�vila" (it is recived from database from MyBatis) and I need "Ávila". – Ramon Apr 15 '14 at 14:09
  • 1
    How are you retrieving the String from MyBatis? That is where you need to deal with a charset conversion from Windows-1252 to UTF-16 (Java's native String encoding). Even if you have to use `getBytes()`, you should be specifying `Windows-1252` instead of `UTF-8` in the `String` constructor since you are not dealing with UTF-8 bytes at all. – Remy Lebeau Apr 16 '14 at 18:55

3 Answers3

11

Obviously, textoFormado is a variable of type String. This means that the bytes were already decoded. Java then internally uses a 16-bit Unicode representation. What you did, is to encode your string with Windows-1252 followed by reading the resulting bytes with an UTF-8 encoding. That does not work.

What you need is the correct encoding when reading the bytes:

byte[] sourceBytes = getRawBytes();
String data = new String(sourceBytes , "Windows-1252");

For using this string inside your program, you do not need to do anything. Simply use it. If - however - you want to write the data back to a file for example, you need to encode again:

byte[] destinationBytes = data.getBytes("UTF-8");
// write bytes to destination file here
Seelenvirtuose
  • 20,273
  • 6
  • 37
  • 66
  • Thanks for answering, it have sense and it give me some ideas. But I use MyBatis for to launch queries to the database, this return me text in type String. I tried the next code for return back, but dont coded correctly: byte[] textBytes = textoFormado.getBytes("UTF-8"); String value = new String(textBytes , "Windows-1252"); – Ramon Apr 15 '14 at 13:09
  • 3
    Use `textoFormado.getBytes("Windows-1252")` instead. Forget about UTF-8, it does not apply in this situation, and you are not using it correctly anyway. – Remy Lebeau Apr 16 '14 at 18:57
1

I solved it thanks to all.

I have the next project structure:

  • MyBatisQueries: I have a query with a "select" which gives me the String
  • Pojo to save the String (which gave me the String with conversion problems)
  • The class which uses the query and the Pojo object with data (that showed me bad decoded)

at first I had (MyBatis and Spring inject dependencies and params):

public class Pojo {
    private String params;
    public void setParams(String params) {
        try {
            this.params = params;
        }
    }

}

The solution:

public class Pojo {
    private String params;
    public void setParams(byte[] params) {
        try {
            this.params = new String(params, "UTF-8");
        } catch (UnsupportedEncodingException e) {
            this.params = null;
        }
    }

}
Ramon
  • 85
  • 1
  • 2
  • 8
1

Why not tackling the issue at a lower level: reading the String in proper encoding from your database.

Most JDBC connection-string or URIs support the property characterEncoding.

So in you Microsoft SQL Server case you could have for example jdbc:sqlserver://localhost:52865;databaseName=myDb?characterEncoding=utf8.

Then each String column should be read in the specified encoding without the need to (re-)convert it manually to it.

See also:

hc_dev
  • 8,389
  • 1
  • 26
  • 38