5

I'm using Netbeans building a web application using Java, JSP that handle a database with Hebrew fields.

The DDL is as follows:

String cityTable = "CREATE TABLE IF NOT EXISTS hebrew_test.table ("
                            +"id int(11) NOT NULL AUTO_INCREMENT,"
                            +"en varchar(30) NOT NULL,"
                            +"he varchar(30) COLLATE utf8_bin NOT NULL,"
                            +"PRIMARY KEY (id)"
                            +") ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1;";
String insert = "INSERT INTO hebrew_test.table (en, he) VALUES ('A','a')";
String insert2 = "INSERT INTO hebrew_test.table (en, he) VALUES ('B','ב')";
String insert3 = "INSERT INTO hebrew_test.table (en, he) VALUES ('C','אבג')";


executeSQLCommand(cityTable);
executeSQLCommand(insert);
executeSQLCommand(insert2);
executeSQLCommand(insert3);

The output tabel I get:

1   A   a
2   B   ?
3   C   ???

Instead of:

1   A   a
2   B   ב
3   C   אבג

I tried Hebrew appears as question marks in Netbeans, but that isn't the same problem. I get the question marks in the table.

Also I defined the table to be in UTF8_bin as you can see in the above code.

Community
  • 1
  • 1
Matan Touti
  • 191
  • 2
  • 14

2 Answers2

7

You need to tell the JDBC driver to use UTF-8 encoding while decoding the characters representing the SQL query to bytes. You can do that by adding useUnicode=yes and characterEncoding=UTF-8 query parameters to the JDBC connection URL.

jdbc:mysql://localhost:3306/db_name?useUnicode=yes&characterEncoding=UTF-8

It will otherwise use the operating system platform default charset. The MySQL JDBC driver is itself well aware about the encoding used in both the client side (where the JDBC code runs) and the server side (where the DB table is). Any character which is not covered by the charset used by the DB table will be replaced by a question mark.

See also:

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • You are the best, my friend!!! I've been searching the solution for a long time. Changed almost all the character sets and collations :D Thanks a lot! – Medet Koilybay Jun 18 '17 at 13:01
2

You're including your values directly into the SQL. That's always a bad idea. Use a PreparedStatement, parameterized SQL, and set the values as parameters. It may not fix the problem - but it's definitely the first thing to attempt, as you should be using parameterized SQL anyway. (Parameterized SQL avoids SQL injection attacks, separates code from data, and avoids unnecessary conversions.)

Next, you should work out exactly where the problem is really occurring:

  • Make sure that the value you're trying to insert is correct.
  • Check that the value you retrieve is correct.
  • Check what's in your web response using Wireshark - check the declared encoding and what's in the actual data

When checking the values, you should iterate over each character in the string and print out the value as a UTF-16 code unit (either use toCharArray() or use charAt() in a loop). Just printing the value to the console leaves too much chance of other problems.

EDIT: For a little context of why I wrote this as an answer:

  • In my experience, including string values as parameters rather than directly into SQL can sometimes avoid such issues (and is of course better for security reasons etc).
  • In my experience, diagnosing whether the problem is at the database side or the web side is also important. This diagnosis is best done via logging the exact UTF-16 code units being used, not just strings (as otherwise further encoding issues during logging or console output can occur).
  • In my experience, problems like this can easily occur at either insert or read code paths.

All of this is important as a way of moving the OP forward, not just in a comment-like request for more information.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • thanks Jon i found that:http://www.roseindia.net/jdbc/prepared-statement-example.shtml – Matan Touti Jan 01 '13 at 18:34
  • @MatanTouti: I wouldn't say that's a great page - bad exception handling, no resource clean-up, etc... http://docs.oracle.com/javase/tutorial/jdbc/basics/prepared.html is somewhat better, although still not perfect... – Jon Skeet Jan 01 '13 at 18:37
  • the values i trying to insert are correct, by the debugger (is it good enough verification?): allprofessions = "INSERT INTO prodb.professions (id, profession, professionEn) VALUES ( 1,'אחר','other'),( 2,'חשמלאי','electrician'),( 3,'אינסטלטור','plubmber')". The value im getting actually are the table i showed above. And how to check a web response is over my knowledge.. i think that the answer more as a declaration of UTF 8 somewhere (i'm guessing). – Matan Touti Jan 01 '13 at 18:52
  • @MatanTouti: Well, as I said - use Wireshark to see exactly what's coming in the response. If you haven't used Wireshark before, now is a good time to learn - *all* web developers should know how to see what's coming across the wire... – Jon Skeet Jan 01 '13 at 18:55
  • the server im running is localhost, with xampp. can i use wireshark to examine that although its localhost? – Matan Touti Jan 01 '13 at 19:08
  • @MatanTouti: That may make it harder, admittedly. You might be able to use Fiddler (http://fiddler2.com) instead. Or just find another machine to use to make the request. Alternatively, use `curl` or something similar - anything which will let you see the *exact* data in the response, in binary form. – Jon Skeet Jan 01 '13 at 19:09
  • You're not answering the question in any way. There are only 2 places where question marks would appear when a wrong character encoding is used. The response encoding is clearly scratched from being suspect. The request encoding as you seem to suspect is impossibly the cause. It would only have resulted in Mojibake, not in question marks. – BalusC Jan 02 '13 at 02:35
  • @BalusC: I don't think *anything* is scratched from being suspect, until it's actually proved. I've seen enough bad diagnoses of this sort of thing to be very reluctant to make assumptions. My answer is more about *how to diagnose things more clearly* than anything else, and I think that's an important part of making progress. So it's not the final solution - but it's a step along the way. – Jon Skeet Jan 02 '13 at 06:45
  • @BalusC: I think we've had similar disagreements on several questions. I believe my answer is an answer to how to move forward. I think we'll have to agree to disagree on this - I think it's fine as an answer, and I'm not going to move it to a comment. – Jon Skeet Jan 02 '13 at 13:09
  • @BalusC: This isn't the first time you've basically accused me of providing answers in an attempt to get reputation instead of helping. Please desist; it's simply untrue. We disagree about what is helpful, and what constitutes an answer, but using terms like "potentially vote-generating blah" and "clueless shooting in the dark" is not the way forward, IMO. Feel free to flag for moderator consideration if you're not happy to just leave things be. – Jon Skeet Jan 02 '13 at 13:24
  • Okay, I think this has gone far enough - further than is constructive. I don't view my answer as "randomly googled crap" nor do I regard Java/servlets/encodings as something "which I know nothing about". I'm flagging your comment for moderator attention, not to try to obtain a penalty or anything similar - just to sort this out. – Jon Skeet Jan 02 '13 at 13:31
  • As per your update, you're only describing cases where Mojibake will occur, not question marks. In other words, you're confusing Mojibake with question marks. Plus, the OP has clearly demonstrated that even inserting hardcoded Hebrew results in question marks, which thus clearly excludes HTTP from being the cause. The OP has also explicitly mentioned that the problem is visible straight in the table (thus he looked in there using some DB admin tool), so the cause is clearly in JDBC driver part. – BalusC Jan 02 '13 at 13:43
  • @BalusC: It excludes HTTP from being the cause at insert, but it wasn't clear (IMO) whether the tool used for diagnostics would have been capable of getting the right result. It seemed *likely* that the problem was JDBC rather than HTTP, but IMO it was still worth helping the OP to diagnose the problem more specifically. (Personally I'd have written a console app which just did an insert then fetch.) Aside from anything, this is "teach a man how to fish" territory. – Jon Skeet Jan 02 '13 at 13:52
  • It's very unlikely that using the "wrong" diagnostics tool would have resulted in question marks. More unlikely than HTTP would be the cause. I recommend to read the "See also" link in my answer to learn when question marks would occur instead of Mojibake. – BalusC Jan 02 '13 at 13:55
  • @BalusC: Again, I think we'll have to agree to differ on what's helpful and what's not. (I've seen plenty of bad ways of diagnosing issues which can very easily end up with question marks - using a logger with a platform-default encoding could do it for exactly the reasons you've given, for example.) – Jon Skeet Jan 02 '13 at 13:58
  • Again, you're confusing Mojibake with question marks. Using the wrong logger encoding would have resulted in Mojibake, not in question marks. Please read the "See also" link to learn more. – BalusC Jan 02 '13 at 13:58
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/22060/discussion-between-jon-skeet-and-balusc) – Jon Skeet Jan 02 '13 at 14:01