Discover the charset from foreign database

Question

I have mysql database (not mine). In this database all the encodings set to utf-8, and I connect with charset utf-8. But, when I try to read from the database I get this:

×¢×?×§ 1
×‘×™×ª ×ª×•×’× ×” ×”×¢×•×¡×§ ×‘×ž×¡×¤×¨ ×©×¤×•×ª ×ª×•×›× ×” 
× × ×œ× ×œ×¤× ×•×ª ××—×¨×™ 12 ×‘×œ×™×œ×” ..

What I supposed to get:

עסק 1
בית תוגנה העוסק במספר שפות תוכנה
נא לא לפנות אחרי 12 בלילה ..

When I look from phpmyadmin, I have the same thing(connection in pma is to utf-8). I know that the data is supposed to be in Hebrew. Someone have an idea how to fix these?

score 0 · Accepted Answer · edited May 23 '17 at 12:21

You appear to have UTF-8 data that was treated as Windows-1252 and subsequently converted to UTF-8 (sometimes referred to as "double-encoding").

The first thing that you need to determine is at what stage the conversion took place: before the data was saved in the table, or upon your attempts to retrieve it? The easiest way is often to SELECT HEX(the_column) FROM the_table WHERE ... and manually inspect the byte-encoding as it is currently stored:

If, for the data above, you see C397C2A9... then the data is stored erroneously (an incorrect connection character set at the time of data insertion is the most common culprit); it can be corrected as follows (being careful to use data types of sufficient length in place of TEXT and BLOB as appropriate):
1. Undo the conversion from Windows-1252 to UTF-8 that caused the data corruption:
```
ALTER TABLE the_table MODIFY the_column TEXT CHARACTER SET latin1;
```
2. Drop the erroneous encoding metadata:
```
ALTER TABLE the_table MODIFY the_column BLOB;
```
3. Add corrected encoding metadata:
```
ALTER TABLE the_table MODIFY the_column TEXT CHARACTER SET utf8;
```
See it on sqlfiddle.

Beware to correctly insert any data in the future, or else the table will be partly encoded in one way and partly in another (which can be a nightmare to try and fix).

If you're unable to modify the database schema, the records can be transcoded to the correct encoding on-the-fly with CONVERT(BINARY CONVERT(the_column USING latin1) USING utf8) (see it on sqlfiddle), but I strongly recommended that you fix the database when possible instead of leaving it containing broken data.
However, if you see D7A2D73F... then the data is stored correctly and the corruption is taking place upon data retrieval; you will have to perform further tests to identify the exact cause. See UTF-8 all the way through for guidance.

Another qustion. How i do these in php? I tried $a = iconv('UTF-8', 'cp1252//TRANSLIT//IGNORE', $a); and it worked, but when I use htmlspecialchars, it says "Invalid multibyte sequence in argument", and I specify utf-8 charset in htmlspecialchars. — Kirill, Jul 13 '13 at 15:08
@Kirill: I think you should probably [post a new question](http://stackoverflow.com/questions/ask), detailing what you are trying to accomplish and what you are seeing instead. — eggyal, Jul 13 '13 at 15:14
I want to read from database correctly without changing it, because it isn't my database, and I don't sure I will have the permission to change it. So I want to correct the text in php, I tried to do these with iconv, but i had the problem that i described above. — Kirill, Jul 13 '13 at 15:20
@Kirill: I covered this possibility in the most recent edit to my answer above. See the paragraph beginning "*If you're unable to modify the database schema...*". Another possibility would be to deliberately set an erroneous connection character set, but that might break other things and will only exacerbate problems for the future. — eggyal, Jul 13 '13 at 15:24

Discover the charset from foreign database

1 Answers1