1

I am using a mysql database Ver 14.14 Distrib 5.5.21, for Linux (x86_64). I save strings into this database using prepared statements in a java class.

And now I would like to make sure that all strings I save are in UTF-8 format and contain no broken (as defined in the database creation schema) characters. Because it already happened that strings were broken and therefore contained questions tags instead of the characters that should be there. In my case, it was shown "R��ckenschmerzen" instead of "Rückenschmerzen". The German character "ü" was broken. Is it possible to find such errors via a JUnit test?

Any help would be appreciated. Thank you in advance. Horace

Roman C
  • 49,761
  • 33
  • 66
  • 176
Horace
  • 1,198
  • 2
  • 17
  • 31
  • How do you differentiate strings that are broken or not? – Roman C Nov 08 '12 at 15:49
  • *"...and contain not broken..."* Contain broken what? Codepoint sequences? – T.J. Crowder Nov 08 '12 at 15:49
  • @ T.J. Crowder: broken characters. – Horace Nov 08 '12 at 15:50
  • @Horace: So yes, codepoint sequences. To improve a question, *edit* the question. – T.J. Crowder Nov 08 '12 at 15:52
  • @ Roman C.: strings that are broken contains questions tags instead of the characters that should be there. In my case, it was shown "R��ckenschmerzen" instead of "Rückenschmerzen". The german character "ü" was broken. – Horace Nov 08 '12 at 15:53
  • @Horace This doesn't mean that they're broken but has other encoding. If you use UTF-8 your DB must support it, does it? – Roman C Nov 08 '12 at 15:55
  • @Roman: Yes it does support utf-8. But despite of this I could save a string with another encoding. The example I mentioned above proves it, doesn't it!? – Horace Nov 08 '12 at 16:01
  • @Roman: Hi Roman. It is because I do not want to configure the database as a whole, rather I would like to make sure that characters that are to be saved into the database are in UTF8-Format. For that, one must not configure the whole database. One can set it when needed in the create statement for a special database. For example: CREATE TABLE IF NOT EXISTS `newTableName` (...blabla_fielddeclarations_blabla...) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ; – Horace Nov 12 '12 at 15:44
  • @Horace Ofcourse you can but when I started to talk about collations and gave you information on it after that you decided to kick me off. The last of your DML statement implied for the source of my answer. You are didn't mention in your question about a table. That's why I decided to give you exhaustive answer. – Roman C Nov 12 '12 at 16:22

2 Answers2

0

By default MySql database configured to use latin1 charset but you could change that in my.ini

# The default character set that will be used when a new schema or table is
# created and no character set is defined
#default-character-set=latin1

default-character-set=utf8

The collation is used by default is utf8_general_ci but there's other collations, total "650 languages" supported, check the manual.

Roman C
  • 49,761
  • 33
  • 66
  • 176
  • Thank you very much for the information, Roman. But I think I asked the wrong question. The right question should be: How can I check if a string is in a valid UTF-8 format (using java)? Because I think if you set your database to UTF-8 and then erroneously write a string in another encoding in it, it will be saved nevertheless (putting replacement code U+FFFD � for unknown character). So the solution for me is to check if strings in database contain U+FFFD or �. – Horace Nov 08 '12 at 16:50
  • [This is a better answer for this](http://stackoverflow.com/questions/6622226/check-if-a-string-is-valid-utf-8-encoded-in-java) – Roman C Nov 08 '12 at 17:03
0

@Roman: Thank you very much for the information, Roman. But I think I asked the wrong question. The right question should be: How can I check if a string is in a valid UTF-8 format (using java)?

Because I think if you set your database to UTF-8 and then erroneously do a write operation of a string in another encoding into it, it will be saved nevertheless (putting replacement code U+FFFD � for eventually unknown character).

So the solution for me is to check if strings in database contain U+FFFD or �.

Or another preventive solution is to make sure that the characters in my string are all in utf-8 before I save it into the database, e.g.:

    String myString = "blablabla";
    String finalStringToBeInserted = new String(myString.getBytes(), "UTF-8");
    saveToDatabase(finalStringToBeInserted);

Regards, Horace

Horace
  • 1,198
  • 2
  • 17
  • 31