0

I have created a database which stores Central European country information. I am having a really hard time making the character set work rightly for these countries because they have so many special characters. I know the "iso-8859-2 Latin 2" character set is used for Central European countries, but when I try to change my database and table character set settings (via phpmyadmin), it doesn't seem to completely remove the "garbled characters." I have been trying to follow this link's guide to fix the problem.

Here is an example from the first row of the database:

�esk� republik', 'CZ', '100 00', 'Praha 10-Stra�nice (?�st) x)', 
'Hlavn� m?sto Praha', 1)

Here is the whole MySQL dump file

I am very thankful for your time.

  • 1
    Your dump looks corrupted with invalid utf8 symbols. – Raymond Nijland Aug 15 '18 at 19:30
  • Yes, that is what I am talking about. Do have any suggestions to fix that, @RaymondNijland? The corrupted symbols are probably the result of not being able to read the special characters present in Central Europian languages. – Jordan Merrill Aug 15 '18 at 20:23
  • The dump file is no longer posted. See "black diamond" in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored – Rick James Aug 22 '18 at 20:55
  • What encoding is in your _client_? That is key to solving your problem. – Rick James Aug 22 '18 at 20:55

1 Answers1

1

The Github link is really talking about Mojibake, not "double encoding".

You are using Czech characters, correct? I see two failings in that snippet of output: "black diamonds with question marks" and "ordinary question marks". They are handled separately in Trouble with UTF-8 characters; what I see is not what I stored

But, before trying to solve the problem, figure out what encoding is being used in the client.

Black diamonds (�esk�)

Case 1 (the client is using latin2, not utf8`):

  • The bytes to be stored are not encoded as utf8. If you can change this, do so. The link assumes that is the goal, not latin2. It will take more research to figure out what to do if the client really needs to be latin2.
  • The connection (or SET NAMES) for the INSERT and the SELECT not set to the client's encoding (latin2 or utf8 or utf8mb4).
  • Also, check that the column in the database is CHARACTER SET utf8mb4. (Yes, you could store as latin2, but since you need to fix stuff, let's go with the preferred encoding.)

Case 2 (original bytes were UTF-8):

  • The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
  • Check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Question Marks (regular ones, not black diamonds) (m?sto):

  • Check the client encoding (as above)
  • The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)

Black diamonds tends to be a browser-only problem, due to the lack of <meta charset=UTF-8>. Most browsers today default to that, but they can get confused.

See the link for using SELECT col, HEX(col) ... for debugging what has been stored.

CONVERT(CONVERT(CONVERT(BINARY('éáčďéěíňóřšťúůýž') USING utf8) USING latin1) USING utf8) 
                           --> '��??�?�?�?�?�?�� 

So, I would guess that you are actually using latin1, not latin2. Run

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin2                     | <--
| character_set_connection | latin2                     | <--
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | latin2                     | <--
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Those 3 need to be set according to the encoding in the client.

Rick James
  • 135,179
  • 13
  • 127
  • 222