Converting string encoding in java

Question

Related to this question: "Fix" String encoding in Java

My project encoding is UTF-8.

I need to make a query to a DB that uses a particular varchar encoding (apparently EUC-KR).

I take the input as UTF-8, and I want to make the DB query with the EUC-KR encoded version of that string.

First of all, I can select and display the encoded strings using the following:

ResultSet rs = stmt.executeQuery("SELECT name FROM mytable");
while(rs.next())
    System.out.println(new String(rs.getBytes(1), "EUC-KR"));

I want to do something like:

PreparedStatement ps = conn.prepareStatement("SELECT * FROM MYTABLE WHERE NAME=?");
ps.setString(1,input);
ResultSet rs = ps.executeQuery();

Which obviously won't work, because my Java program is not using the same encoding as the DB. So, I've tried replacing the middle line with each of the following, to no avail:

ps.setString(1,new String(input.getBytes("EUC-KR")));
ps.setString(1,new String(input.getBytes("EUC-KR"), "EUC-KR"));
ps.setString(1,new String(input.getBytes("UTF-8"), "EUC-KR"));
ps.setString(1,new String(input.getBytes("EUC-KR"), "UTF-8"));

I am using Oracle 10g 10.1.0

More details of my attempts follow:

What does seem to work is saving the name from the first query into a string without any other manipulation, and passing that back as a parameter. It matches itself.

That is,

ResultSet rs = stmt.executeQuery("SELECT name FROM mytable");
rs.next();
String myString = rs.getString(1);
PreparedStatement ps = conn.prepareStatement("SELECT * FROM mytable WHERE name=?");
ps.setString(1, myString);
rs = ps.executeQuery();

... will result with the 1 correct entry in rs. Great, so now I just need to convert my input to whatever format that thing seems to be in.

However, nothing I have tried seems to match the "correct" string when I try reading their bytes using

byte[] mybytearray = myString.getBytes();
for(byte b : mybytearray)
    System.out.print(b+" ");

In other words, I can turn °í»ê into 고산 but I can't seem to turn 고산 into °í»ê.

The byte array given by

rs.getBytes(1)

is different from the byte array given by any of the following:

rs.getString(1).getBytes()
rs.getString(1).getBytes("UTF8")
rs.getString(1).getBytes("EUC-KR")

Unhappiness: it turns out that for my DB, NLS_CHARACTERSET = US7ASCII

Which means that what I'm trying to do is unsupported. Thanks for playing everyone :(

There is no `setName` method in [`PreparedStatement`](http://download.oracle.com/javase/6/docs/api/java/sql/PreparedStatement.html) - and if it were, it would not do what you want. Either use `setString(1, ...)` or `setBytes(1, ...)`. — Paŭlo Ebermann, Apr 28 '11 at 01:21

bmargulies · Accepted Answer · 2011-05-02T02:24:29.830

5

You can't accomplish anything with a String constructor. String is always UTF-16 inside. Converting UTF-16 chars to EUC-KR and back again won't help you.

Putting invalid Unicode into String values in the hopes that they will then be converted to EUC-KR is a really bad idea.

What you are doing is supposed to 'just work'. The oracle driver is supposed to talk to the server, find out the desired charset, and go from there.

What, however, is the database charset? If someone is storing EUC-KR without having set the charset to EUC-KR, you are more or less up a creek.

What you need to do is to tell your jdbc driver what charset to use to communicate with the server. You haven't mentioned if you are using Thin or OCI, the answer might be different.

Judging from http://download.oracle.com/docs/cd/E14072_01/appdev.112/e13995/oracle/jdbc/OracleDriver.html, you might want to try turning on defaultNChar.

In general, it's the job of the jdbc driver to transcode String to what the Oracle server wants. You may need tnsnames.ora options if you are using 'OCI'.

edit

OP reports that the nls_charset of the database is US7ASCII. That means that all JDBC drivers will think that it is their job to convert Unicode String values to ASCII. Korean characters will be reduced to ? at best. Officially, then, your are up a creek.

There are some possible tricks to try. One is the very dangerous trick of

 new String(string.getBytes("EUC-KR"), "ascii")

that will try to make a string of Unicode chars that just so happens to have the values of EUC-KR in their low bytes. My belief is that this will corrupt data, but you could experiment.

Or, perhaps, ps.setBytes(n, string.getBytes("EUC-KR")), but I myself do not know if Oracle defines the conversion of bytes to chars as a binary copy. It might. Or, perhaps, adding a stored proc that takes a blob as an argument.

Really, what's called for here is to repair the database to use an nls_charset of UTF-8 or EUC-KR, but that's a whole other job.

edited May 02 '11 at 02:24

answered May 02 '11 at 01:58

bmargulies

97,814
39
186
310

Okay, so doing conversion on the Java side won't help me? Right now, I'm using Thin. I can try using the OCI driver. – May 02 '11 at 02:01
Creating the connection with the `java.util.Properties` configuration (with `"defaultNChar"` mapped to `true`) does not seem to change any behavior, in both reading results from and passing parameters to the DB. Is there some way of getting the driver to tell me what the correct encoding is, or telling the driver to use a particular encoding? – May 02 '11 at 02:13
The important question is the database charset (not the nchar set). What is it? – bmargulies May 02 '11 at 02:15
Run "select value from nls_database_parameters where parameter = 'NLS_CHARACTERSET';" and report what it says. If it doesn't say EUC-KR, then life is difficult. – bmargulies May 02 '11 at 02:15
Um, I guess I'm up a creek. `NLS_CHARACTERSET = US7ASCII` and `NLS_NCHAR_CHARACTERSET = AL16UTF16` – May 02 '11 at 02:16
1

Someone has been cheating here. You've got a database with EUC-KR stuffed into US7ASCII. Yes, a creek is what you are up. Have you tried ps.setBytes(x, string.getBytes("EUC-KR"))? – bmargulies May 02 '11 at 02:18
Is there still a way to do this using OCI? – May 02 '11 at 02:19
I don't suppose you could add a stored proc that accepted a blob of bytes and cheated them into the table for you? – bmargulies May 02 '11 at 02:19
@bmargulies yes, I have... it's just that I can't seem to create the correct set of bytes starting from a non-garbled string, regardless of what kind of `nio.charset.encoder` shenanigans I try... – May 02 '11 at 02:20
Perhaps post another question about that? – bmargulies May 02 '11 at 10:45
Yes, yes, I have... and the problem's been solved. Thank you all :D (I'll give you your bounty later.. still has 30 minutes left before I can) – May 03 '11 at 01:05

score 0 · Answer 2 · answered Apr 28 '11 at 01:29

0

Have you looked at the correct name for the charset ? Maybe you should be using UTF8 and EUC_KR ..

http://download.oracle.com/javase/1.4.2/docs/guide/intl/encoding.doc.html

answered Apr 28 '11 at 01:29

Kal

24,724
7
65
65

I've been able to correctly decode EUC-KR characters for use from another query using the charset identifier "EUC-KR". Now that I try it, the name "EUC_KR" gives the same results. Nice guess, though... – Apr 28 '11 at 01:32
So you tried ps.setString(1, new String(input.getBytes("UTF8") , "EUC_KR")) ? – Kal Apr 28 '11 at 01:39
That's an awful hack that is likely to corrupt data. – bmargulies May 02 '11 at 01:59

score 0 · Answer 3 · answered May 02 '11 at 01:50

0

Hopefully this is not a stupid answer but have you made sure that charsets.jar is in your classpath. It is NOT by default see this page for more...

The charsets.jar file is an optional feature of the JRE. To install it, you must choose the "custom installation" and select the "Support for additional locales" feature.

answered May 02 '11 at 01:50

Andrew White

52,720
19
113
137

:D I wondered about that as well and tried using the nio.charset libraries, making encoders and decoders using all of the alternative spellings for the charsets I am attempting to use. None of them threw a `UnsupportedCharsetException`, or any exception at all actually... – May 02 '11 at 02:03
Only needed in ancient versions of Java. – bmargulies May 02 '11 at 02:16
I am in fact using jdk 1.5 (1.5.0.22 to be exact... the bug report linked to is 1.5.0.05 so I don't think I'm affected anyhow.) – May 02 '11 at 02:27
@bdares: just for the record, and future readers, that wasn't actually a "bug" but rather a symptom of a particular installation. I am glad you found your answer and I'll leave mine posted so that others can at least eliminate that path as potential solution. – Andrew White May 02 '11 at 02:49

Converting string encoding in java

3 Answers3