0

I have a page, that loads data from different databases (which could have different charset). Problem is that it loads with broken charset to UTF-8. And I need to find a way, how to load it properly.

My connection is:

$db = new PDO("mysql:host=".DBHOST.";dbname=".DBNAME, DBUSER, DBPASS);
$db->setAttribute(PDO::MYSQL_ATTR_INIT_COMMAND, 'SET NAMES utf8'); 
$db->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION); 

as you can see, I use 'SET NAMES utf8'

I have <meta charset="utf-8"> in <head>

And I have tried some conversions:

 error_log("ORGIGINAL: ".$row["title"]);
 error_log("ICONV: ".iconv(mb_detect_encoding($row["title"], mb_detect_order(), true), "UTF-8", $row["title"]));
 error_log("UTF_ENCODE: ".utf8_encode ($row["title"]));

I believe I have all files loaded in UTF-8 too (re-saved every file in notepad switching from ANSI to UTF-8. then tried this tool for verification https://nlp.fi.muni.cz/projects/chared/)

now, where the fun begins: Not only that I got the wrong output, but I also have a different output for the browser and error log.

Original string stored in DB: http://screenshot.cz/F7/F7XRF/sdb.png

FIREFOX reaction:

Original:

http://screenshot.cz/TG/TG7RX/for.png

utf8_encode:

http://screenshot.cz/H9/H9IZJ/fu.png

iconv: same as utf8_encode

and now, how it was loaded into PHP error file: http://screenshot.cz/FY/FYXEE/el.png

As you can see, the output has the best result in the original shape, while if trying to convert, it has a more deformed output. Also tried to change the error log file charset to UTF-8 (original unknown/ANSI probably), but the same shape in both encodings)

The text is central-europe/czech. needed characters are: á é ý í ó ú ů ž š č ř ď ť ň ě

So, any ideas, where can be something wrong?

Thanks :)

Mahmut Salman
  • 111
  • 1
  • 10
Zorak
  • 709
  • 7
  • 24
  • I have previously written [**an answer**](http://stackoverflow.com/questions/31897407/mysql-and-php-utf-8-with-cyrillic-characters/31899827#31899827) that contains a little checklist, that will cover *most* of the charset issues in a PHP/MySQL application. There's also a more in-depth topic, [UTF-8 All the Way Through](http://stackoverflow.com/questions/279170/utf-8-all-the-way-through). Most likely, you'll find a solution in either one or both of these topics. – Qirel May 31 '17 at 13:36
  • Have you used any other character sets than utf8? – Rick James May 31 '17 at 17:24

1 Answers1

1

Do not use any conversion functions.

There are two causes for black diamonds; see Trouble with utf8 characters; what I see is not what I stored

The error file is exhibiting Mojibake, or possibly "double encoding". Those are also discussed in the link above.

Check that Firefox is interpreting the page as UTF8. Older version did not necessarily assume such.

Oh, I just noticed the plain question mark. (Also covered in the link.) You win the prize for the most number of was to mangle UTF8 in a single file!

This possibly means that there are multiple errors occurring. Good luck. If you provide HEX of the data at various stages (in PHP, in the database table, etc), I may be able to help in more detail.

An issue with the Czech character set is that some characters (those with acute accents) are found in western European subsets, hence are more likely to be rendered correctly. The other accents are mostly specific to Czech (with carons), and go down a different path. This explains why some of your samples exhibit two different failure cases. (Search for Czech on this forum; you may more tips.)

After some experimentation...

?eské probably comes from the CHARACTER SET of the column in the table being latin1 (or other "latin"), plus establishing the connection as being latin1 when inserting the data. That can be seen on the browser when it is in Western mode, not utf8.

?esk� shows up if you do the above and also have latin1 as the connection during selecting. That is visible with the browser set to utf8.

Caveat: The analysis may not be the only way to get what you are seeing.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Hi, thanks for response, I will go through and try. But for the first look, it seems like 99% of problem is how data is stored. Well, that is bad. I am creating a web app, which has to connect to many databases of different CMSs, and those CMSs are not in my reach and they might be set up differently. They have large databases of articles and they are running projects. Sadly, I simply cannot edit those databases, nor load and resave data to correct encoding. So I rather need solution, how to load any existing string with any (detected?) encoding and reencode it correctly for use within my app. – Zorak May 31 '17 at 18:28
  • May _may_ be a way to fix the data with a small number of SQL statements. It depends on how badly mangled the data is. Get the HEX (as discussed in my link), then see the "Fixes" in the comment on that page. But _simultaneous_ with fixing the data, the CMS must stop storing things incorrectly. – Rick James May 31 '17 at 21:09