2

This is a common question which has been asked for many times before. However, I still cannot get the right answer from google.

In my web app, there is a form for collecting data, the app and all data is collecting in UTF-8. However, mistakenly, the collection of the schema and the table has been set as latin1. Moreover, during the connection, "SET NAMES UTF8" has been used.

Now some of the data in Chinese is always showing as question mark(?), no matter what conversion method I use. Query problem columns as binary also shows the data is several bytes of 3f, meaning several '?'s.

If my data still be able to convert to utf-8 and shows correctly or already lost?

[UPDATE]

This is not the same question with How to convert an entire MySQL database characterset and collation to UTF-8?. Because I have not just converted the entire database and table to UTF-8 but also mysqldump and re-import it into the database. However, none of them works.

[UPDATE 2]

The problem is not just about converting table charset but also need to understand UTF-8, Latin encoding system.

Basic knowledge is:

Latin use only 1 byte which 8 bits for storing.

UTF-8 use dynamic storing system which means MAY NOT just 1 byte.

Since UTF-8 encoding system needs at least 1 bit for identification, that means only 7 bits could be used for storing compare with Latin. So, if characters just need 7 bits to store, it can successfully store in Latin with UTF-8 representation. However, if data exceeds 7 bits, it will have been broken.

So, such Chinese and Japaneses, it needs 2 to 3 bytes for storing, that will damage the data during storing process because the first byte in UTF-8 representation already exceed the range that Latin can store.

That's why no matter how I change the charset of both the database and the table it still shows '?', because in Latin, every character that out of the range will be presented in '?', 3F in HEX.

Channa
  • 742
  • 17
  • 28
panda
  • 1,344
  • 3
  • 14
  • 35
  • @HoussemBdr I have done that and nothing changed. – panda Oct 18 '16 at 09:27
  • Please give as some more details, how are you converting, using json? Or somwthing else? – Houssam Badri Oct 18 '16 at 10:00
  • @HoussemBdr A UTF8 Chinese use 3 bytes to store, like '們' in \xE5\x80\x91. after store it in to a latin1 table, it becomes '?'. So I use alter table to change the table collection from latin1 to utf8, it still shows in '?'. By using Convert(column using binary), it still shows in '?'. It seems last 2 bytes has been dropped during the storing procedure. – panda Oct 18 '16 at 10:30
  • It always woks for me that with my Arabic language. Check with the newly stored chinese words after making the alter database, not the already stored ones – Houssam Badri Oct 19 '16 at 04:54
  • @HoussemBdr The purpose of this post is to recover data that already stores in the table, but found out it's impossible. The reason has been updated in the post. Thanks for your comments. – panda Oct 20 '16 at 07:56
  • Yang good that you find finally the truth (even not good); indeed data are then already damaged. good work – Houssam Badri Oct 20 '16 at 11:31

1 Answers1

0

Juste change the character set of the entire database:

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;

And of course you can do it for some table.

Further more have a look at the documentation here.

EDIT:

OtherWise, if you data are already sotred in "?" marks, the reality is that it is damaged.

Houssam Badri
  • 2,441
  • 3
  • 29
  • 60