4

I have a website, with arabic content which has been migrated from a different server. On the old server, everything was displaying correctly, supposedly everything was encoded with UTF-8.

On the current server, the data started displaying incorrectly, showing نبذة عن and similar characters.

The application is build on the CakePHP Framework.

After many trials, I changed the 'encoding' parameter in the MySql connection array to become 'latin1'. For the people who don't know CakePHP, this sets MySql's connection encoding. Setting this value to UTF8 did not change anything, even after the steps described below.

Some of the records started showing correctly in Arabic, while others remained gibberish.

I have already gone through all the database and server checks, confirming that:

  1. The database created is UTF-8.
  2. The table is UTF-8.
  3. The columns are not explicitly set to any encoding, thus encoded in UTF-8.
  4. Default Character set in PHP is UTF-8
  5. mysql.cnf settings default to UTF-8

After that, I retrieved my data and looped through it, printing the encoding of each string (from each row) using mb_detect_encoding. The rows that are displaying correctly are returning UTF8 while it is returning nothing for the rows that are corrupt.

The data of the website has been edited on multiple types, possibly with different encodings, this is something I cannot know for sure. What I can confirm though, is that the only 2 encodings that this data might have passed through are UTF-8 and latin1.

Is there any possible way to recover the data when mb_detect_encoding is not returning anything and the current dataset is unknown?

UPDATE: I have found out that while the database was active on the new server, the my.cnf was updated. The below directive was changed:

character-set-server=utf8

To

default-character-set=utf8

I am not sure how much this makes a difference though.

Checking the modified dates, I can conclude to a certain degree of certainty that the data I could recover was not edited on the new server, while the data I couldn't retrieve has been edited.

Adon
  • 345
  • 1
  • 12
  • if you have phpmyadmin .. could you see your data in the tables ... is it correct formatted .. i mean the arabic characters ?? – Mohammad Alabed Jun 17 '15 at 18:08
  • No, the data is corrupt inside phpmyadmin. I am actually outputting the data i read and it is corrupt, so it is basically the same logic. – Adon Jun 17 '15 at 18:12
  • 1
    if the data is corrupt inside your DB .. that means your problem is not from Cackephp .. it is from the DB itself .. export your DB again from old server with UTF8 – Mohammad Alabed Jun 17 '15 at 18:16
  • No the problem is not in CakePHP nor in PHP for that matter. I was just highlighting the usage of the connection encoding. Unfortunately, I cannot export from the old server to the new one anymore. I am looking for workaround to: 1) get the actual encoding of the corrupt text 2) try to change it back to the proper text – Adon Jun 17 '15 at 18:21
  • @Adon do you still have the export file from the old server? Perhaps you could look at the text in that file. If it hasn't exported correctly then you probably have a big problem as it sounds like you've lost data. – drmonkeyninja Jun 17 '15 at 18:24
  • no way .. as I know it is impossible to fix corrupt data in DB if it is already like that – Mohammad Alabed Jun 17 '15 at 18:25
  • @MohammadAlabed There should be a way somehow. I remember working on a similar problem around 8-9 years ago. I was able to recover the data with only one corrupt character. So, I'm not losing hope, and I don't want to go into trial and error with reading byte data and whatnot for now. I'm checking if someone has a somewhat straightforward answer I missed. – Adon Jun 17 '15 at 19:02
  • @Adon when you exported the DB .. what was the character set , maybe there is some way by export you current DB and fix the file then import it again – Mohammad Alabed Jun 17 '15 at 19:51
  • @MohammadAlabed you mean using some sort of text editor (like notepad++) and try manipulating the encoding there? – Adon Jun 17 '15 at 21:13

3 Answers3

0

Try to fix the problem from DB side .. not from php or DB connection

I advice you to go to your old server and export your DB again with character set UTF8

then after import it to a new server .. be sure that you can see the arabic characters inside the tables(with phpmyadmin) if your tables looks fine ..

then you can move to check the next

  • DB connection

  • php file encoding

  • the header encoding in html

as I know if the problem from the DB .. there is no way without export the data again from the old server

Edit:

if you do not have access to your old DB please check this answer it can help you

Community
  • 1
  • 1
Mohammad Alabed
  • 809
  • 6
  • 17
  • As i stated earlier, I do not have access to the old server anymore. This solution will not work for me. I will post an update in question, because I have new input. – Adon Jun 17 '15 at 18:44
  • I will check this as soon as possible and let you know. Thanks. – Adon Jun 17 '15 at 21:14
  • I read about some think like convert the DB to binary then convert it again to the UTF8 .. i am not sure ... you can check here http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html – Mohammad Alabed Jun 18 '15 at 09:28
0

You were expecting نبذة عن? Mojibake. See duplicate for discussion and solution, including how to recover the data via a pair of ALTER TABLEs.

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • That is correct. I am curious about how you have achieved this. I already tried the solution and it did not work for me. I have to use the `text` data type instead of `varchar` because of size constraints, but I don't believe this would cause a problem because they basically store data the same. I am using mysql 5.5.42, but I also believe this is not the problem as this behavior should be the same in all recent 5.x versions. Any thoughts on this? – Adon Jun 25 '15 at 13:49
  • `TEXT` vs `VARCHAR` -- they work the same. Version 5.x -- The problem manifests itself the same way since 4.1. `BINARY(CONVERT('نبذة عن' USING latin1)) = نبذة عن `. The stored data _should_ be (in HEX): D986D8A8...; HEX for what you say was C383E284A2C3A2E282AC... [More discussion](http://mysql.rjweb.org/doc.php/charcoll). – Rick James Jun 26 '15 at 01:10
  • True, I am getting the correct hexadecimal string. The problem is converting back to the proper string. It is converting back to the same unrecognizable text. The queries are `alter table dynamic_page_i18n modify COLUMN content varbinary(1000) NOT NULL DEFAULT '';` and then `alter table dynamic_page_i18n modify COLUMN content text CHARACTER SET utf8 NOT NULL DEFAULT '' ;` No luck though. I know they should work, but they aren't. The database is utf8, the table is utf8 and the collation is utf8_general_ci. – Adon Jun 26 '15 at 09:27
  • Recap. You have Dxxx hex in `content`; good. You have `text ... CHARACTER SET utf8`; good. But `SELECT` gives garbage? This probably means that the connection is not 'utf8'. You could also do `SET NAMES utf8;` right after connecting. – Rick James Jun 26 '15 at 15:22
0

I had a similar problem with migrating database tables encoded with utf8 from a public server to localhost. The resolution was in setting the localhost server encoding using PHP

$db->set_charset("utf8") 

right after the mysqli connection.

Now it works properly.

Frits
  • 7,341
  • 10
  • 42
  • 60
aevum.v
  • 1
  • 1